Using Item Response Theory to Evaluate Self-directed Learning Readiness Scale

Item Response Theory becomes one of the most popular methods for instruments development and evaluation methods. This baseline study is a self-directed learning readiness (SDLR) 40 items scale with data from 648 undergraduate psychology female students attending Qassim University in Saudi Arabia through randomized selection to evaluate an SDLR scale at item and scale levels using GRM. Results provide more detailed diagnostic information to modulate the scale. GRM analysis led to the detection of two locally dependent items, one item with low discrimination parameter and 15 model misfit items. The scale often tends to measure low and moderate levels of SDLR. Advanced psychometric evaluations should be made and the SDLR scale must be reviewed based on quantitative and qualitative analysis.


Introduction
Self-directed learning (SDL), which is becoming one of the most important methods in adult education, is defined as the process where learners take the initiative, with or without others assistance, to personalize their learning needs, formulate learning objectives, characterize different resources for learning, and select suitable learning strategies using ongoing evaluation processes for learning objectives (Knowles, 1975).When the concept of self directed learning readiness (SDLR) first appeared, it was defined as the degree to which the learner possesses the attitudes, abilities, and personality characteristics needed for SDL (Wiley, 1983).
Many researchers agree that SDLR's built in assumptions are: adults are ingrained to be self-directed; self-directing could be improved; and, the capability of self-directing could be generalized to many other learning situations (Candy, 1991;Fisher, King, & Tague, 2001;Gugliemino, 1989).Further, researchers also examined the relationship and the effect of SDLR with many educational and psychological variables to find that SDLR could play a main role in students'behavior (Whiley, 1983;O'Kell, 1988;Dyck, 1986).Relatively few scales have been developed to measure students' SDLR within the educational context.Fisher, King, and Tague (2001) developed a scale, recently translated into Arabic by Al Hassoun (2017), which is the exception and might be considered as the most popular scale developed in recent years.Fisher et al. (2001) used their scale in a study of nursing undergraduate students, but did not include any items relating to a specific major, which makes it applicable to any academic field.
Yet, the SDLR scale has not been sufficiently examined psychometrically; few studies have investigated its factor structure.Most researchers who have investigated the psychometric properties of the SDLR scale used Classical Test theory (CTT), thus confining their investigations to the scale's reliability and validity indices to verify the scale's use.Fisher et al. (2001) investigated the psychometric properties of their SDLR scale in a study of 201 nursing undergraduate students.During the scale's development process, they examined the scale's structural validity using principal component analysis and item-total correlation coefficients.The results showed first, that the scale is unidimensional with three main factors: self-management (SM), learning desire (LD), and self-control (SC); and, second, that acceptable values of reliability indices for each scale, which made the scale valid and consistence.rather than consisting of three factors, the scale consisted of four: critical self-evaluation, learning self-efficacy, self-determination, and effective organization for learning.Fisher and King (2010) followed with a study that evaluated their scale using confirmatory factor analysis (CFA) on a sample of 227 undergraduate nursing students.Their investigation included using three-one factor congeneric modeling and showed that the best model fit was obtained after the elimination of 11 items -SM1, SM9, SM12, LD7, LD8, LD9, SC1, SC2, SC8, SC10, SC15 -because of their low loadings.Torabi et al. (2013) examined Fisher et al.'s (2001) SDLR scale among preliminarily schoolteachers in Esfahan.Results showed that the scale approximately fit using CFA, and they concluded that it could be used to evaluate SDLR among teachers.Williams andBrown (2013) investigated Fisher et al.'s (2001) SDLR scale construction using several CFA models on a sample of 233 undergraduate Australian students.The results showed that the model of 40 items with three factors was not a fit for the data, while the 36 items with the four factors model fit better than the 29 items with the three factors model.
As the concept of SDL readiness currently receives more attention, the above-mentioned research only attempted to use CTT and factor analyses methods to verify the SDLR scale.The scientific revolution and the existence of several research instruments means the evaluation of research instruments is an essential need especially for those that are frequently used, such as Fisher et al.'s SDLR scale.Although this scale has yet to be investigated using advanced psychometric analysis methods, the current study aims to evaluate the SDLR scale using IRT, one of the most important approaches to evaluate and develop scales because of its accuracy in item and personal levels.

Participants
The present study involved the participation of 636 female undergraduate psychology students, with an average age of 19.91(sd= 0.99), in the faculty of education at Qassim University in Saudi Arabia during the academic year 2016/2017: participation was on a voluntary basis.

Self Directed Learning Readiness (SDLR)
The participants of the present study were subjected to the 40 items SDLR scale developed by Fisher et al. (2001), translated into Arabic by AlHassoun (2017).The SDLR contained three parts designed to estimate three dimensions of the SDLR: self-management, learning desire, and self-control.Each part consists of 13, 12, and 15 items.

Research Procedure
The SDLR scale was administered to the study participants, each participant required approximately 25 minutes to complete the scale.

Statistical Procedure
The SDLR scale has a graded polytomous response format and Samejima's graded response model (GRM) is considered the optimal IRT model to use.The GRM was designed for ordinal polytomous items, and is considered a generalization for the 2-Parameter model, which expresses the behavior of an item by its discrimination parameter ( , and a set of threshold parameters ( 1, … … . ., located between the continuous categories of a polyatomic items ( 1, … … . ., (Attorresi, Abal, Galibert, Lozzia, and Aguerry, 2011;Samejima, 1969).
The GRM, like all IRT models derives the probability of the responses of each item as a function of the latent trait and item parameters.It is also an estimate of the cumulative probability of responding in a category or above.This probability can be plotted for each category that produces categorical response curves (CRC) (DeMaes, 2010).Baker (2001) pointed that GRM, like most IRT models, relies on two main assumptions: 1) Unidimensionality -all items belong to a common construct that influence the item responses.
2) Local independence -all elements of the item vector for respondents are independent from each other.

Basic Statistics
Total scores for the SDLR and its dimensions were computed.Then, the means and standard deviations were extracted, as shown in Table 1.The results illustrated that the sample performance was best for self-control.This result can be attributed to how Muslims raise their children, with commitment, avoidance of rage, respect toward others, kept away from taboos, and other ethics ordered by the Islamic religion.In other words, this result was caused by the effect of Islamic rules that control Muslims' internal and external behavior, which leads to Muslims' high self-control.The means and standard deviations of each item were also calculated, as shown in Table 2, for further clarification of the participants' performances based on the SDLR scale.Note.SMi: Self-management items; LDi: Learning desire items; SCi: Self Control items.
As shown in Table 2, the item labeled SC3 (I am responsible for my own decisions) received the highest mean value.But, responses were less varied across sample respondents, which could be explained by the respondents' age: respondents were adolescents and tend toward independent decision making.

Reliability
The values of the alpha coefficient were computed for each dimension score and the overall scale, as shown in Table 3.The results illustrated that the SDLR scale has a moderate value of reliability indices.The learning desire item had the lowest alpha value, which could be explained by the number of items for this dimension.

Correlations
The correlations between scale dimensions and total score were computed, as shown in Table 4.The correlations showed significant relationships among the three dimensions of SDLR, which confirmed that each dimension represents a component that correlates with the others.The correlations also showed moderate and significant relationships (p˂0.01) between SDLR dimensions, and a highly significant relationship with the total score of the SDLR, which proved its consistency.

Checking Model Assumptions
Unidimensionality, which is a primary assumption for GRM, was checked using both EFA and CFA.EFA yielded a general factor with 22.85% initial variance, and the ratio of initial variance for the first and second factors was 4.28.As Reckase (1979) suggested, if the first factor has a variance that exceeds 20%, and if the ratio of initial variance for the first and second factors exceeds 2, these results confirm the scale's unidimensionality.
CFA was also conducted for the 40 items of SDLR using a one general factor model with three subscales model.The goodness of fit (GOF) indices were computed: goodness of fit index (GFI) =0.90, comparative fit index (CFI) =0.90, incremental fit index (IFI) =0.90, and root mean square error approximation (RMSEA) =0.037.While all GOF indices had acceptable values, the model of one factor model was a good fit with the data, which is the second piece of evidence for unidimensionality.
Local independence was examined by checking the discrimination parameter ,.Items with high slope parameter (e.g., 4 refer to potential violations.The values of local independence chi square (LD ) were also used for an additional assessment for local independence.An item with LD 10 suggests serious violations for item independency (Nguyen, Han, Kim, & Chan, 2015).The results showed that all slope parameters had values less than 4, and the values of LD for items pairs showed significant violation in items independency for two items: LD1 (I want to learn new information) and LD2 (I enjoy learning new information), which seem to have nearly the same meaning.

IRT Analyses of SDLRScale
The analysis of the SDLR items using GRM, which is one of the IRT models for polytomous items, were computed using the Bock-Aitikin method implemented in IRTPRO4 software (Cai, Thissen, & du Toit, 2011).
Item parameters for the SDLR scale were computed using GRM, as shown in Table 5, where 's represents the ability level with an associated probability of 0.5 to respond to the above threshold.The location parameter represent the discrimination power of each item between respondents with high and low ability scores.

Threshold Parameter
The polytomous nature of SDLR scale items with five possible responses means that each item has four response threshold parameters: , , , and .According to GRM, these thresholds depict the trait level with a 50% chance of scoring at or above a scale response, and they express important information about items (Steinberg &Thissen, 1995).
While the SDLR scale is self-reported, and measures how respondents feel about themselves and self-directed learning readiness, it is also desirable that scale items introduce good information on a wide range of the latent trait.
Table 5 shows the threshold parameters for the SDLR scale wherein each item in the scale covered a certain range in the latent continuum: some items had relatively low threshold parameters (e.g., LD4: I enjoy a challenge, LD10: I learn from my mistakes), while others had relatively high threshold parameters (e.g., LD5: I enjoy studying, SM4: I set strict time frames).Nevertheless, the results in Table 5 showed that items thresholds were consistent, and each item covered an acceptable range of the latent trait, but mostly in the moderate and low trait level with some expectations for items SM1, SM4, LD5, LD6, and LD12.Item SC3 (I am responsible for my own decisions/ actions) covered a low limited ability level of the latent trait, however, which was below -1 for all categories, which could be explained by its generality (not specified in self directer learning context).The rest of the SDLR scale tended to be at the low-mid level of the latent trait continuum as most of the b′s had negative values, which is less than the midpoint of the latent trait continuum.

Discrimination (Slope) Parameters
Results of Table 5 show that item slopes range from 0.49 to 1.69.Items LD4 (I enjoy a challenge), LD11 (I need to know why), and LD12 (When presented with a problem I can't resolve, I will ask for assistance), had low discrimination parameters, which could be explained by items content, which is a type of generalization not necessarily related to self directed learning ability.Items SC1 (I prefer to set my own goals), SC6 (I prefer to set my own learning goals), SC10 (I have high personal expectations), SC14 (I have high belief in my abilities), and SC15 (I prefer to set my own criteria on which to evaluate my performance) had high discrimination parameters, which could be explained by their content that is well specified and related to self directed learning readiness.The remaining items had acceptable moderate discrimination parameters, according to Baker (2001).
Category response functions (CRF) were also extracted, as shown in Figure 1.CRF represents the variety of score probabilities as a function of the latent trait, they describe the probability that the respondent will respond to an item with 1, 2, 3, 4, or 5.As shown in Figure 1, some items had low peaks and high overlap between category functions, such as item LD4, which led to low discrimination parameters.While others had high peaks and less category functions overlap, such as item SC6, which led to high discrimination parameters.
The information introduced by each item and the whole scale were computed, as shown in Table 6.As well as item and scale, information functions were plotted for the SDLR scale for each trait level, as shown in Figure 2. As shown in Figure 2, most of the SDLR scale items had relatively high information at low ability levels of the trait, exclusively some items introduced good information along the ability continuum, such as items SM4 and LD5, while other items were quite weak and introduced low information along the ability continuum, such as items LD11 and LD12.This result was associated with low discrimination power for these two items, which made them unable to provide good information across the ability continuum.

Test Information Function
The test information function (TIF) is simply the sum of the item information at each ability level (Baker, 2001).
It is considered a useful feature to estimate the quality of the test as a measure of a particular latent trait.Figure 3 shows the test information function for the SDLR scale.As shown in Figure 3, the total information for the SDLR scale has good reliability for low and moderate ability levels of the trait.This reliability becomes lower for ability levels above 1.2, and the marginal reliability index was 0.93, which indicates that the SDLR scale is reliable.
Figure 3. Test Information Curve for SDLR scale

Item and Model Fit
Item fit was assessed using the generalized item fit statistics , which tend to control the type І error rate (Kang & Chen, 2011).Items fit indices were computed, items with a probability less than 0.05 were considered as misfit items.The results showed that the GRM better fit the data than the partial credit model; 16 of 40 items did not fit the GRM model (SM2, SM3, SM4, SM10, SM13, LD1, LD3, LD6, LD9, LD11, LD12, SC2, SC3, SC5,and SC13).
The overall goodness of fit for GRM was once examined for the 40 items scale and additionally for the 24 items scale after eliminating misfit and low discriminated items, using the statistic (Maydeu-Olivares & Joe, 2006).GRM fit statistic for the 40 SDLR scale was ( =17937.95,Prob.= 0.0001, RMSEA= 0.21),while these values for 24 SDLR scale were =114.30,Prob.= 0.0001, RMSEA= 0.00).This means that the 16 misfit items had significant violations for GRM.Marginal reliability for the 40 SDLR scale was 0.93, and 0.90 for the 24 SDLR scale, which means that the scale was still reliable after eliminating 16 items.

Discussion
The main goal for the current study is to evaluate the psychometric properties of SDLR scale and its availability as a measure of SDLR using item response theory models, while most researchers examined its properties using CTT techniques and factor analysis.
CTT is a traditional approach to improve measurements.The theory is mainly concerned with observed scores, true scores, and the error score.The theory assumes relatively simple assumptions, but its results are sample specific, rendering it insufficient (Budgell, Raju, &Quartetti, 1995;Hambleton, Swaminathan, & Rogers, 1991;Hulin, Drasgow, & Pearson, 1983).One of IRT's most important properties is that item characteristics are independent upon a person's characteristics (Baker, 2001).Recently attention has been devoted to IRT models capable of analyzing rated data using either ordinal, categorical, or nominal scales, which make it applicable to any type of psychological assessment instrument.For the advantages discussed here, the current study analyzed the data of SDLR scale using Samejima's graded response model.
The results of discrimination parameters for each individual item showed that some items discriminate better than others between respondents, such as SC1, SC6, and SC15; the rest of the scale items discriminate well with the existence of diversity across items.This result is close to the results obtained by Fisher et al. (2001).The three items had relatively high values of corrected item total correlation, which is considered as an indicator of item discrimination (Cappelleri, Lundy, & Hays, 2015).
The results of item threshold parameter showed that most of the scale items tend to measure moderate and low levels of self directed learning except items SM4, LD5, and LD6, where it was suitable to measure the high ability levels of self directed learning.
Results of item fit statistic showed that only 23 items of the SDLR scale fitted the model well.For the overall goodness of fit of the GRM: showed that these 23 items fit the model well, while the overall 40 items scale had an unacceptable value of , which means that these violations were significant.
The results of the current study revealed important information about the SDLR scale not previously identified.In general, the SDLR scale could be considered a good scale with some features that require improvement.The evaluated version of the 23 SDLR scale could be considered as a brief scale for SDLR, which is valid and accurate for identifying moderate and low levels of self directed learning readiness.One recommendation is that some items could be added for the brief SDLR scale with high discrimination parameters and high threshold values to measure high levels of self directed readiness.

Conclusion
The current study considered as the evidence for the usefulness of using IRT analysis to provide important additional psychometric information, which will impact the quality of the instruments.

Figure 2 .
Figure 2. Item Information Functions for SDRL scale

Table 1 .
Basic statistics for the scale dimensions

Table 3 .
Alpha coefficient values for SDLR scale and its dimensions

Table 4 .
The correlations between SDLR scale dimensions and total score

Table 5 .
Item parameters for SDLR scale

Table 6 .
SDLR scale items information