Factor Structure and Reliability of Test Items for Saudi Teacher Licence Assessment

The Saudi National Assessment Centre administers the Computer Science Teacher Test for teacher certification. The aim of this study is to explore gender differences in candidates’ scores, and investigate dimensionality, reliability, and differential item functioning using confirmatory factor analysis and item response theory. The confirmatory factor analysis results for 6 371 examinees’ scores of 66 multiple-choice items when grouped into three content domains showed that the test data were unidimensional (ability, trait). The domains were highly correlated (0.883 to 0.949) within this dimension. Data reliability estimated through latent variable modelling was acceptable at 0.848. Gender results for DIF signalled 13 items, five cases against males and eight cases against females; a finding of some balance in DIF direction against males and females. The study results confirm the validity of the Computer Science Teacher Test and support further refinement of multiple forms of the test.


Introduction
Teacher assessment is used for measuring and supporting pre-teacher education outcomes and teachers' professional development.In a review, DeLuca and Bellara (2013) found a multitude of teacher assessment standards used by national educational authorities with numerous assessment literacy measures.Further, the authors noted shifts in teacher education curricular concepts together with evolution in the national measures for student outcomes.In another review, Blömeke and Delaney (2014) noted that whilst teacher assessment studies from North America and other English-speaking countries focussed on internal assessment systems and practices and contained some cross-country comparisons, a trend to cultural comparison of teacher assessment systems had not yet emerged.This paper first introduces the Saudi educational environment, and this is followed by a short literature review.The methodology and results are presented and discussed, and conclusions drawn.

Saudi Education System
The Ministry of Education is the sole authority for education in Saudi Arabia, providing a free education for all Saudi students through to higher education.The Ministry also oversees a small educational private sector, generally for expatriates.Saudi schools are gender segregated, thus there is a significant number of men in the profession.The World Bank (2016) reported that in 2014 there were 761 737 trainees and teachers, pre-school to secondary school, of whom 52 per cent were women.
Teachers in Saudi Arabia are viewed in two ways.Because of their early association with mosques they are admired, although the secular education system set up after the 1932 declaration of the Kingdom foundered for decades due to its concepts of over-worked and underpaid teachers (Al-Rasheed, 2010).From the 1950s the corporation Saudi Aramco (2016) assisted the government in establishing schools to alleviate issues with illiteracy, eventually building 139 schools.Initially, boys only were permitted an education due to traditional beliefs in the conservative society, but by 1960 the first primary school for girls was opened with one student (Bowen, 2015).
As oil revenues became available, Saudi Arabia was in a better position to plan for the future, and in 1970 set in place the first of its five-year economic plans.Education was a priority, both for literacy for the population and to provide the nascent public sector with Saudis to replace the largely foreign workforce (Alshahrani & Alsadiq, 2014).However, the population growth in the late 20 th century surpassed the Ministry of Education's ability to provide all Saudis with a quality education, and by the 6 th economic development plan (1995)(1996)(1997)(1998)(1999) a concentrated effort was made to improve the 'Saudisation' of the country's workforce, that is, replacing skilled expatriates with skilled Saudis.This emphasis on education continues today (Ahmed, 2016).

Teacher Education
Teaching in 20 th century Saudi schools was criticised as being conservative and didactic (Norton & Syed, 2003).Teachers' education was expected to be at bachelor degree standard, but due to the pressure of population growth, this was not enforced and diplomates were accepted.Pedagogical practices were didactic, teachers explained principles of the curriculum, but did not engage the students who were thus passive learners, recording their lessons and memorising for examinations.A report for the Ministry of Education recommended, inter alia, improved teacher education, and in 2004 the Ministry embarked on a decade-long plan (Tatweer) to improve the quality of education in the Kingdom (International Bureau of Education, 2011).
As part of Tatweer's emphasis on teacher education, competencies were prepared for pedagogical, numeracy and literacy skills; however, these were not adequately administered and did not achieve the standards expected (Alzaydi, 2011;Alsharif, 2011;Al Shannag, Tairab, Dodeen, & Abdel-Fattah, 2013).Elyas and Pickard (2013) stated that teacher outcomes were challenged by variables in students' backgrounds, the rise of educational technology, and universities' hierarchies.Shortfalls in teacher competencies had external effects for the Ministry of Education.Comparing Saudi and Singapore results for grade 8 students from a 2007 international study (Trends in Mathematics and Science Study), Al Shannag et al. (2013) found that the Saudi teachers retained their teacher-centric style, whilst the more successful Singaporean teachers practised a student-centric educational system.
The Ministry of Education responded to these reports by implementing a change in focus from teacher to student.The National Centre for Assessment in 2010 developed a new teacher assessment framework, the National Professional Teacher Standards.The framework comprises 12 standards in two groups, the first of which was pedagogical: professional knowledge, promoting learning, supporting learning, and professional responsibility (Al-Saud & Al-Sadaawi, 2014).The second type is the subject-specific teaching standards for 25 curricular courses.The standards guide teacher licensing examinations, identify training needs for new teachers, and set the quality of teaching programs.
As an example, one of the courses is the Computer Science Teacher Test (CSTT) for secondary school.It consists of three domains: computer and math, engineering and science, computer applications, and computer and education.Based on the 2010 standards, the test has been administered to 20 028 candidates, of whom 37 per cent were female (Ministry of Education, 2016).
This study investigates the validity of the test data by examining their dimensionality and key features such as reliability, and differential item functioning on gender in the framework of item response theory.

Literature Review
Confirmatory factor analysis seeks relationships between measurement data, which is, test results or indicators, and is used to identify latent variables (factors) (Brown, 2015).Unlike exploratory factor analysis, confirmatory factor analysis is hypothesis-based, thus all aspects of the model are pre-specified.This form of analysis is used to 'verify the number of underlying dimensions of the instrument (factors) and the pattern of item-factor relationships (factor loadings)' (Brown, 2015, p.1). Netemeyer et al. (2013) stated that confirmatory factor analysis can be used to assess dimensionality (fit, correlated measurement errors, degree of cross-loading).
In designing tests and measures which produce large data such as the Computer Science Teacher Test, dimensionality refers to the homogeneity of items and sub-items.Netemeyer, Bearden, and Sharma (2003) explained that a unidimensional measure indicates a single latent variable that accounts for item data (responses), whereas a multidimensional measure has more than one latent variable among the data.In designing such tests, a unidimensional internal structure is a step towards establishing reliability (consistency between items) and validity (consistency between the measure's constructs).Whilst unidimensionality is used in confirmatory factor analysis, it is also a fundamental assumption in item response theory (Deng, Wells, & Hamilton, 2008).
In longitudinal research, the analysis of measurement invariance of latent constructs is important as scores may vary over time.For example, in education, repetitive examination of cohorts of students determines the progress of individuals over the course of their education or is used to compare group scores.Measurement invariance was predated by Jöreskog's (1971, p.409) observation of 'similarities and differences in factor structures between different groups'.Jöreskog posited that parameters in factor analysis models (factor variances, factor loadings, factor covariance and unique variances) may be constrained, or assigned an arbitrary value.Measurement invariance was introduced by Byrne, Shavelson, and Muthén (1989) using sensitivity analyses for stability in baseline models, 'determining partially invariant measurement parameters, and . . .testing for the invariance of factor covariance and mean structures, given partial measurement invariance' (Byrne et al. 1989, p. 456).Measurement invariance, or measurement equivalence, thus establishes that each iteration measures the same construct (latent variable).
Reliability concerns the permanent effect that is being investigated does persist from one sample to another.Raykov (2004) and Raykov, Dimitrov, and Asparouhov (2010) used latent variable modelling for measurement invariance and reliability.Raykov (2004Raykov ( , 2012) ) argued that coefficient alpha does not estimate scale reliability at population levels, and proposed another reliability coefficient model based on scale reliability rather than the restrictions of Cronbach's α (Cronbach, 1951).Cronbach's α requires that the factor loadings of all items are equal.More recently, Raykov, Gabler, and Dimitrov (2016, p.1) established a latent variable modelling procedure 'for point and interval estimation of the difference between the maximal and scale criterion validity coefficients'.This overcomes issues regarding the use of unidimensional multicomponent measures.
Criterion-related validity is one aspect of validating an instrument, referring to an item on a questionnaire actually measuring the intended outcome (Lodico, Spaulding, & Voegtle, 2010).The others include face validity (relevance of items to intent), content validity (items relevant to the content being measured).Criterion-related validity reflects the relationship between two scores on two different measures, and tests whether the outcome from the measure, its performance, can be predicted (Lodico et al. 2010).Raykov's (2007) latent variable modeling approach is used in this research for reliability and criterion validity.
Item response theory, a paradigm for the measurement of items in relation to the latent variable, is used extensively in education tests, including test construction, estimating ability and score reporting (Deng et al., 2008).Item response models take into consideration the degree of difficulty of each item in scaling items.Item response theory has, as noted, an assumption of unidimensionality (Deng et al. 2008).
Differential item functioning refers to the potential for bias in the test items which could skew data (be unfair) to sub-groups based on gender, race or age (Strobl, Kopf, & Zeileis, 2010).The bias may exist in a single item, or goodness-of-fit tests may show a trend, or a likelihood of bias among the variables.
There is a wide variety of statistical techniques for evaluating difference in both dichotomous and polytomous items (Gómez-Benito, Hidalgo, & Zumbo, 2013;Hambleton & Swaminathan, 2013;Sireci & Rios, 2013).Among these, that of Mantel-Haenszel (1959) remains a reference technique (Guilera, Gómez-Benito, Hidalgo, & Sánchez-Meca, 2013).Strobl et al. (2010) explained that testing for difference can be based on the specific sub-group supporting interpretation but leaving open the possibility of unexplained bias.At an extreme, all item parameter differences can be tested for bias among all possible sub-groups, leading to interpretation difficulty.Strobl et al. proposed a semi-parametric model using recursive partitioning to address this.

Methodology
The data were the scores of 6 371 examinees on 66 multiple-choice items on the Saudi Computer Science Teachers Test.The test had four response options per item, one only of which was correct, so the item scoring is 1 for correct response and 0 otherwise.The test items were classified as follows: Confirmatory factor analysis was used to test the validity of hypothesised models of the test and its three content-specific domains.The first question concerned the dimensionality of the data.Three different confirmatory factor analysis models were tested and compared on data fit with the teacher test scores: model A: a one-factor model; model B: a three-factor model with the three content-specific domains as correlated latent factors; and model C: a three-factor model with the three content-specific domains as uncorrelated latent factors.The models were tested for data fit using the program Mplus (Muthén, 2016).In the Mplus syntax for the three models, the factor indicators (test items) were declared as categorical variables because the item scores are dichotomous (0/1).Thus the factor analysis was based on the tetrachoric correlations (i.e., observed values are dichotomous) for the scores of the test items.This avoided issues using Pearson correlations for factor analysis of categorical variables.The analysis of test data for item response theory used the program Xcalibre 4 (Assessment Systems, 2016).
The score reliability was estimated through the use of a latent variable modelling (LVM) approach taking into account the binary nature of the item scores (Dimitrov, 2012;Raykov, 2007;Raykov et al., 2010).The congeneric model for latent normal variables * , * , …., * , assumed to underlie a set of binary items Y 1 , Y 2 , …, Y p according to Jöreskog (1971) where η is a common latent factor with a variance set equal to 1, λ i are factor loadings, ε i are latent disturbances, and the probability of correct response on Y i is given by the area under the standard normal curve to the right of a pertinent threshold κ i (i = 1, 2, …, p).Under this model, the score reliability, ρ, is estimated through the following equation (e.g., Bollen, 1989): where the numerator represents the true-score variance and the denominator represents the total variance (i.e., the sum of true variance and error variance).
Cronbach's α for internal consistency (reliability) was also used, however, the underestimated the reliability obtained from the latent variable modelling approach, confirming the literature review discussion.Further, under the congeneric measurement model in equation 1, the assumption of tau-equivalency is met when the factor loadings are equal, λ 1 = λ 2 = … = λ p (e.g., Jöreskog, 1971).
In differential item functioning analyses, groups are compared on item performance after adjusting for overall performance on the measured trait (Hambleton & Swaminathan, 2013).The Mantel-Haenszel techniques under the null hypothesis are distributed as a chi-square distribution with one degree of freedom.Under this procedure, an effect size estimate based on the common odds ratio α is expressed as Holland and Thayer (1988) proposed a logarithmic transformation of α for interpretive purposes, with the aim of obtaining a symmetrical scale in which a zero value indicates an absence of DIF, a negative value indicates that the item favours the reference group over the focal group, and a positive value indicates DIF in the opposite direction.This transformation, delta metric, is expressed as (4)

Results
The test results for data fit of the three models (A, B, and C) are summarised in Table 1.The assessment of model fit is based on the evaluation of the following goodness-of-fit indices, with cutting scores for an excellent fit as follows: • Comparative fit index: CFI > 0.95; Incremental Fit Index: IFI > 0.95; • Standardised root mean square residual: SRMR = 0.00 (SRMR < 1.00 for an adequate fit); • Root mean square error of approximation: RMSEA = 0.00 (RMSEA ≤ 0.05 for an adequate data fit (Hu & Bentler, 1999;Marsh, Wen, & Hau, 2004).
The results in Table 1 indicate that the one-factor model (model A) provides an adequate data fit.A very slight improvement in data fit is obtained with model B, where the correlations between the three domains of the test are taken into account.These correlations were found to be very high, ranging from 0.883 to 0.949 (see Table 2).The standardised item factor loadings and thresholds of the 66 items of the test under the one-factor CFA model (model A) are provided in the appendix.The analysis of the sample showed 60.5% were females and 39.5% males, which differed from the overall population.All factor loadings were statistically significant (p < .001),with the exception of the loading for item 45 (p = .428)and item 65 (p = .340).
The reliability of the data was estimated by a latent variable modelling (LVM) (equations 1 and 2).The reliability estimate was found to be 0.848 at a 95% confidence level = (0.842; 0.854).Cronbach's α was 0.749 and thus underestimated the LVM reliability (α < 0.848), as discussed above.
The data were tested for DIF across gender using the two Mantel-Haenszel statistics: and Δ (equations 3 and 4), the results of which are provided in the appendix.For interpretation of these results, is reported with a z-statistic and its p-value, where DIF is signalled by statistically significant z-value (p < .(Holland & Thayer, 1988).
Based on these criteria, the results in the appendix indicated that DIF is signalled for 13 items, of which 9 items fall in the category B for moderate DIF (6 against females and 3 against males) and 4 items in the category C for large DIF (2 against females and 2 against males).The remaining 43 items are either not signalled for DIF or were categorised as A, negligible DIF, acceptable for the purposes of this study (see Zwick & Ercikan, 1989).

Conclusion
This study examined the factor structure of the Computer Science Teacher Test and its psychometric characteristics to validate interpretations and decisions about certification of teachers in Saudi Arabia.The results showed that the test items are essentially unidimensional, confirming the use of item response modelling.
The results in this study support the validity of interpretations and decisions related to certification of teachers in Saudi Arabia based on their computer test scores.This outcome should guide test developers and researchers at the National Assessment Centre in further the evolution of the Computer Science Teacher Test.
05); with DIF against males if z >0 and DIF against females if z <0.The absolute values of the statistic Δ are used to classify DIF into three categories: category A ─ negligible DIF when | Δ | < 1.0; category B ─ moderate DFI when 1 ≤ | Δ | ≤ 1.5; and category C ─ large DIF, when | Δ | > 1.5

Table 1 .
Data fit of three CFA models from Teacher Test Data

Table 2 .
Correlations among Teacher Test DomainsData fit results in table 1 showed high correlations among the domains in models A and particularly B, therefore the teacher test data are essentially unidimensional.Model C, where the three test domains are assumed uncorrelated, does not converge with the test data.