Adapting the Critical Thinking Assessment Test for Palestinian Universities

Critical thinking is a key learning outcome for Palestinian students. However, there are no validated critical thinking tests in Arabic. Suitability of the US developed Critical Thinking Assessment Test (CAT) for use in Palestine was assessed. The test was piloted with university students in English (n=30) and 4 questions were piloted in Arabic (n=48). Students responded favorably. Scores were comparable with US scores. Only two students found the content problematic. One-hundred-twelve Palestinian faculty reviewed the skills tested by the CAT. There was moderate agreement that they represent critical thinking. Translation of the CAT into Arabic and further study are warranted.


Introduction
The globalization of higher education over the last few decades has been accompanied by an enormous range of different kinds of assessment; witness for example the rash of new international university rankings that have appeared since the onset of the present millennium focused on assessing university's relative reputation (Rauhvargers, 2011). At the regional, national and institutional levels there have also been widespread efforts to assess the quality of teaching as part of broader Quality Assurance (QA) initiatives (Bernhard, 2012). Both of these kinds of efforts have expanded quickly to include national systems of Higher Education of all types across the world. In Palestine where the modern university sector only began in the mid 1970's and has been beset by considerable unique obstacles due in large part to an ongoing era of occupation and conflict (Abu- Lughod, 2000;Abu-Saad & Champagne, 2006), a national Accreditation and Quality Assurance Commission (AQAC) was nevertheless established in 2002.
Like other national QA schemes, complex issues lie at the core of the Palestinian QA initiative: questions about what should be measured, how institutions should assess their educational impact, and which kinds of outcomes provide the most reliable information about institutional strength, teaching effectiveness, and quality of student learning. Moreover, a critical tension between assessment for accountability and assessment for improving learning is frequently missing in broader policy debates. Improvements to teaching and learning based on assessments of higher order student learning outcomes and evaluations of programs which use such assessments are even less frequent. Indeed, despite the growth of alternative learning-centered methods of assessing student learning (Light, Cox, & Calkins, 2009), traditional modes of instruction and assessment continue to be the main methods teachers use to assess their students' learning. In a national study of undergraduate teaching practices in Palestine, Cristillo (2009), found that teaching and assessment practices were primarily geared toward lower levels of learning such as rote memorization.
activities as key program goals (Basha, 2012) and at the institutional level where the recent development of learning and teaching centers of excellence have identified critical thinking as a major goal of their faculty development initiatives (Daragmeh, Drane, & Light, 2012). Indeed, as a result of a recent, large scale national project focused on developing learning and teaching in Palestinian universities, (Palestinian Faculty Development Project), four such university centers have been established and three national conferences have been organized emphasizing the centrality of learning and the importance of critical thinking in higher education. In addition, in the last year, two national Capacity Building workshops for both faculty and Palestinian trainers of faculty have been held in Jericho in 2014 and Ramallah in 2015 with a special emphasis on critical thinking.
Despite this broad upsurge in interest and activity in critical thinking across disciplines and fields, there has been little to no development on a test for critical thinking that might be similarly used across disciplines. Currently there are no robust (validated and reliable) tests of critical thinking in Arabic which may be used to assess improvement in student critical thinking over time, and which are appropriate to the Palestinian context. Assessment of students' critical thinking using valid and reliable methods is vital to national and institutional efforts to improve critical thinking to a) insure that progress in critical thinking is actually being made, and b) to help identify teaching approaches that lead to the greatest gains in critical thinking. It cannot be assumed that critical thinking skills will improve in students simply because critical thinking has been targeted as a learning outcome. For example, Rawanda has identified critical thinking as key to its national strategy to develop a skilled worked force (MINDEDUC, 2010). However, in a recent study of students at three prestigious universities in Rawanda, Schendel (2015) found that students were not making meaningful gains on a test of critical thinking designed specifically for the Rawandan context.
The purpose of this study was to assess whether one such instrument-the Critical Thinking Assessment Test (Stein & Haynes, 2011)-used extensively in the United States (US) may be appropriate for the Palestinian higher education context. As Schendel (2013) notes, the validity of an assessment in one context does not automatically indicate that the assessment will be valid in another. The aim of the study was to gather data on a) the response of Palestinian students to the test, and b) the response of Palestinian faculty to the critical thinking skills examined on the test.

The Critical Thinking Assessment Test (CAT)
The Critical Thinking Assessment Test (CAT) is a 15 item, short answer essay test developed in the United States with the support of the National Science Foundation (NSF) to assess critical thinking skills in undergraduate students in the fields of Science, Technology, Math and Engineering (STEM) and related fields. It tests critical thinking across four core domains: a) evaluation of information, b) evaluation of ideas and other points of view, c) learning and problem solving, and d) communication of ideas. It does not test rote memory of information, but rather requires students to exercise higher-order thinking skills such as those on the upper levels of Bloom's Taxonomy of Educational Objectives (1956): application, analysis, synthesis, and evaluation. Specific critical thinking skills assessed by the CAT are listed in Table 1 below. Test questions are based on real-world scenarios. Most require short essay answers which reveal the students' thought processes. The short essay format was chosen because it has been shown to be less racially biased, have higher construct validity and to test more skills in the same question than multiple choice questions (US Department of Education, 2000). Below is a sample disclosed item from the CAT. While the CAT is not a timed test, most students take approximately one hour to complete it. "A scientist working at a government agency believes that an ingredient commonly used in bread causes criminal behavior. To support his theory the scientist notes the following evidence.
99% of the people who committed crimes consumed bread prior to committing crimes.
Crime rates are extremely low in areas where bread is not consumed.
Do the data presented by the scientist strongly support their theory? Yes No Are there any other explanations for the data besides the scientist's theory? If so, describe.
What kind of additional information or evidence would support the scientist's theory?
The CAT may be scored by either faculty or graduate students using the detailed scoring rubrics provided with the test. Importantly, it has been shown to be sensitive to course effects (Stein, Haynes, & Redding, 2006). National US norms for performance on the test are available. It has been administered in several hundred colleges and universities across the US and found to be valid and reliable and appropriate for students across all institutional types and levels. In terms of validity, the CAT has been shown to have satisfactory face validity and criterion validity (Stein Haynes, Redding, Ennis, & Cecil, 2007). Face validity of the CAT was established in a study in which the 12 skills were shown to faculty from a variety of disciplines at the 6 US universities where the CAT was first developed. Agreement was 80 percent or higher across the 12 skills indicating a high degree of face validity. (Stein et al., 2007). CAT scores are moderately correlated with general measures of academic performance such as the SAT (r=0.527), ACT (r=0.599), and Grade Point Average (GPA) (r=0.345). In addition, scores on the CAT are moderately correlated with scores on other measures of critical thinking such as the California Critical Thinking Skills Test (CCTST; r=0.645). Test-retest reliability is acceptable at > 0.80 (Stein et al., 2007). Internal consistency of CAT items is also acceptable (Cronbach's alpha=0.695) suggesting that items are measuring the same general construct (Stein et al., 2007). Finally, the cultural fairness of the CAT has been evaluated in the US and has shown that neither gender, race nor ethnic background are statistically significantly associated with performance (Stein et al., 2007). A cultural Differential Item Functioning (DIF) analysis has also been performed and indicates that there were no items with prevalent cultural bias (Stein & Haynes, 2011 Summarize the pattern of results in a graph without making inappropriate inferences. Q2 Evaluate how strongly correlation-type data supports a hypothesis. Q3 Provide alternative explanations for a pattern of results that has many possible causes. Q4 Identify additional information needed to evaluate a hypothesis or a particular explanation of an observation. Q5 Evaluate whether spurious relationships strongly support a claim. Q6 Provide alternative explanations for spurious relationships. Q7 Identify additional information needed to evaluate a hypothesis/interpretation. Q8 Determine whether an invited inference in an advertisement is supported by information. Q9 Provide relevant alternative interpretations of information. Q10 Separate relevant from irrelevant information when solving a real-world problem. Q11 Analyze and integrate information from separate sources to solve a real-world problem. Q12 Use basic mathematical skills to help solve a real-world problem. Q13 Identify suitable solutions for a real-world problem using relevant information. Q14 Identify and explain the best solution for a real-world problem using relevant information. Q15 Explain how changes in a real-world problem situation might affect the solution.
This research consists of a series of 3 studies. Studies 1 and 2 focused on students' responses to the CAT and involved students taking the full test in English and a subset of questions in Arabic respectively. The third study was a survey study focused on faculty responses to the skills assessed on the CAT. Methods, procedures and results of each study are reported below.

Study 1: Students' Responses to the CAT Test in English
Aims of this first study were to examine English speaking Palestinian students' responses to the English version of the CAT to a) determine if the students could relate to the contexts used in the questions b) assess their comfort with the test and c) determine if any aspects of the test were confusing for them.

Participants
The study sample consisted of a convenience sample of 30 students from the faculties of nursing, medicine (n=28) and information technology (n=2) at 2 large, independent, non-governmental Universities in the Palestine. They were invited to participate in the study by 3 of their course professors who were known to the first author (though not at his home institution in Palestine). They were selected to participate in the study because of their excellent English language skills. All had completed their English proficiency course requirements with a grade of at least a B (i.e., a score of 80). All were told that the test was part of a research study and that their test scores would be used only for the research study and would not contribute to their course grade. Characteristics of the students are presented in Table 2. Males and females were equally represented in the sample. They ranged in age from 19-21 years and were freshmen, juniors and seniors. Half of the participants had spent time outside of Palestine, generally for periods of less than 6 months in either European or Arab countries. Only one had spent time in the US. All had learned their English primarily in Palestine.

Testing Procedures
Students completed the CAT test at their institution in a quiet room and were supervised by the first author. Students were not given any course credit or monetary incentive for completing the test. The CAT is not a timed test, so students were told that they could take as much time as they needed. The completed tests were scored by a team of experienced scorers who had been trained in how to score the CAT test by the designers of the test.
Immediately after they finished the test, all 30 students completed a survey in English asking them if there were any aspects of the test directions or content that were confusing. They were also asked to rate the difficulty of the test on a scale from 1 (very difficult) to 7 (very easy) and their interest in the test again on a 7-point scale from 1 (very interesting) to 7 (not very interesting at all). Students were invited to include explanations for their answers. The first author also noted comments that students made to him directly after completing the test.

Results
All 30 students completed the test. Time taken to complete the test ranged from 1 to 2 hours. Scores on the test ranged from 5 to 27 (out of a possible 38) with a mean of 16.4 and standard deviation of 5.7. The mean score fell above the US norm for community colleges (13.5) and between the US norms for freshmen at 4 year institutions (13.7) and seniors at 4 year institutions (19.0). Overall, the students responded favorably to the test, reporting that they found it interesting and motivating. Even though the test questions were developed to fit the cultural context in the US, only 2 students reported on the survey that the content was problematic. One explained the following; "As the questions are related to cases in foreign country, it is difficult to think for possible answers." A second wrote that "Because it was my very first time reading about purification". The same student suggested that it would be easier if the questions were related to cases in their own country. One third (n=10) of the students reported that the test directions were confusing. However, when asked to explain what was confusing, most did not describe a problem understanding the directions, but rather referred to challenges with the type of thinking that was required. For example, one student wrote "When I know that it is a critical thinking assessment, I started criticizing everything and said no to almost every question. May be I should not have been told to give more accurate information". Another student wrote "It is complicated. Too many answers needed to be written with explanations". A third student wrote "It depends on my analytical competencies". Only one student made reference to a specific aspect of the test instructions that was confusing. "The moment you were told the type of study which can include a third variable, the last question was quite confusing." Seven of the 30 students found some of the test questions confusing. This seemed to be mainly because some of the English vocabulary was new to them. Only 2 students commented on the actual content of the test. A second student reported that it was their first time reading about water purification.
Students varied in their opinions about the difficulty of the test. Some found the test very hard, some very easy with the majority finding it moderately difficult. Students' ratings of the difficulty of the test ranged from 2 to 7 (where 1 is very difficult and 7 is very hard). Students also varied in how well they thought they did on the test. Most felt that had done moderately well to well. This is consistent with the responses of college students who have taken the test in the US (Stein, 2012). There was a high level of interest in the test which is also consistent with findings in the US (Stein et al., 2009) with a number of students reporting either that it was a new experience for them or a new way of thinking, and several commenting that they enjoyed the test and the challenge of taking it. Interest ratings ranged from 1 to 7 (where 1 is very interesting and 7 is not very interesting at all), with two thirds of the sample choosing a rating of either 1 or 2). Comments collected from the students after they finished the test shed some light on their capacity to deal with the CAT test. The majority of students believed that they had made substantial gains in their ability to think critically after completing the test. Some students asked to receive instruction in critical thinking, commenting that they had never been exposed to such challenging questions. Some students said that because the test was new for them that they were concerned that they might do the wrong thing and this was stressful for them. At the end of the test a number of students said that they were tired because of the mental effort that the test required.
The mean total score for males was 16.4 (sd=6.0) and the mean total for females was 16.5 (sd=5.5). An independent t-test revealed that this difference was not statistically significant, suggesting no gender bias in the test.

Discussion
Contemporary Palestinian higher education is very different from the modern university environment which prevails in western countries, especially from the environment which exists in the US. In addition to its Middle Eastern location and distinctive cultural character, today's Palestinian university perseveres and flourishes in a unique socio-political context. Characterized from the beginning by occupation and conflict (Abu- Lughod, 2000;Abu-Saad & Champagne, 2006), university students in Palestine live and study today in a social context fundamentally unrecognizable to present day American students. Nevertheless, despite these considerable differences, the results of the study above suggests that the CAT test developed in the U.S. is a valid and meaningful instrument for assessing the critical thinking skills of Palestinian university students, at least of those students whose English language ability is sufficient to complete the English version of the test. Quantitative test results reflected similar results to those obtained with students in American colleges and universities at the higher education level. Palestinian students scored within the range that US students fall and there was no statistically significant difference between the total scores of males and females. These results suggest that the test may not be culturally or gender biased. Qualitative responses of their experience of the test, moreover, reflected the experience of American students. Although one third of the sample who took the test in English reported that the directions for the test were confusing, their confusion seemed to relate chiefly to the challenging nature of what they were being asked to do, rather than to confusing language. Their comments echoed those of many American students who noted their traditional American education did not stress the kinds of thinking skills which the CAT assesses. The essential skills which the CAT assesses were no more alien to or difficult for Palestinian students than they were for American students. Indeed the results suggest the test gauges higher order thinking skills which are common to both sets of students; skills which are, moreover, sought after by educational policy commentators in both contexts (Arum & Roksa, 2011;Fasheh, 2014;AbuLaban, 2014).
It is worth noting that the Palestinian students in the above study were all fluent in English and many had spent time abroad. These students did not necessarily reflect the typical Palestinian University student. Their skills with a second language-particularly English-and experiences abroad might explain the similarity of their experience to those of American students. Additional study of Palestinian students taking the test in Arabic is warranted and we do this in Study 2 below.

Study 2: Students' Responses to the First 4 CAT Questions Translated into Arabic
The aim of this study was to assess Palestinian students' responses to the first 4 questions of the CAT test which had been translated into Arabic.

Participants
Forty-eight students from the same 3 Palestinian universities as the students in study 1 participated in study 2. Fifty percent were female and 50% were male. Twenty-four were freshman and 24 were seniors. Students from the faculties of medicine, education and business were invited to participate in the study by their course professors. The first author contacted a number of faculty at each institution and asked if they would nominate students who might be interested in answering the 4 questions. A few instructors asked if they should nominate their best students. They were told just to invite those students who they thought would be interested in doing the test. Students were told that the 4 questions were part of a longer test of critical thinking. They did not receive any incentive, course credit or other reward for answering the questions.

Testing Procedures
The first 4 questions of the CAT test were translated into Arabic by the first author. The first 4 questions focus on critical thinking skills 1 to 4 listed in Table 1. Translation of the 4 questions was validated by 4 Palestinian university faculty before they were given to students. The first author administered the test questions, and was present while students answered them. Once again, students completed the questions in a room under quiet conditions. They were told that their responses would be anonymous and that no identifying information would be collected. They were given about 45 minutes to answer the 4 questions which was considered sufficient time as the majority of students in the US complete the entire test in just under 1 hour. Participants did not complete a formal survey after answering the 4 questions. However, the first author was present during and after testing and collected comments from participants after they had completed the test.

Results
All 48 students completed the Arabic version of the first 4 questions. Most of the students reported that they had an easy time answering the questions. A few complained about having not been administered a similar test in the past. Some students seemed nervous while answering the questions. A couple of students asked to have more time to answer the questions while a few others stopped suddenly during the test and asked if they were allowed to leave the hall to come back after having a quick break. They were not given a break however. Many of the students offered favorable comments about this opportunity to take the test and said that they wished that their own teachers would focus on this.
Mean scores on the 4 items for the 48 students are presented in Table 3 along with those of the students who took the English version of the CAT, and the norms for US college freshmen and seniors. Mean scores for the Palestinian students were above those of US students for 2 items and within the range of freshmen and seniors for one item. Only scores for the first item fell below those of US students. It is interesting to note the higher scores of the Palestinian students on items 3 and 4. This may have been because a number of the students were medical students who were familiar with critical thinking that is often applied in observational studies in public health.

Discussion
To ascertain whether the results of the first study would hold when the test was not given to Palestinian students fluent in English, the second study investigated the experience of Palestinian students not fluent in English taking the CAT test in Arabic (Hambleton, 2005). While this follow up study only focused on the critical thinking skills measured by the first four questions of the CAT test, the results of the second study support the premise that the critical thinking skills assessed by the CAT are appropriate to Palestinian higher education students more broadly and that an Arabic adaptation of the CAT might be successfully developed and employed to assess the critical thinking skills of Palestinian students in their main language of instruction. While the results suggest the Arabic items of the test are meaningful to the students, differences on some items such as item 1 indicate more refinement may be required for particular items and certainly in a full adaptation of the entire test (Sireci, 2005). The results do indicate a full adaptation of the CAT in Arabic would be valuable.
In addition, it is worth noting that the comments (and symptoms of some anxiety) from students about their lack of preparation for the kind of thinking the test measures, and their interest in having instruction in this kind of thinking, highlight the importance of the role of the instructor in the use and application of such a test. Engagement of the faculty with the key critical thinking skills is necessary if they are to develop further in Palestinian universities.

Study 3: Faculty Perceptions of Critical Thinking Skills Assessed on the CAT
The aim of the third study was to assess the face validity of the CAT for Palestinian faculty. For a critical thinking assessment to be useful in any context, it is important that faculty agree and have confidence that it is a measure of critical thinking. This is particularly important in the case of the CAT because it was designed to be used not only for assessment, but also for faculty development. Although there is a high level of agreement among STEM faculty in the US that the skills tested on the CAT constitute critical thinking (Stein, Haynes, & Redding, 2006), it is important to establish whether or not this is the case for Palestinian faculty.

Participants
All Palestinian universities and university colleges in both the Palestine were invited to participate in the study. The first author phoned the Vice President for Academic Affairs at each institution to gain their approval for faculty at their institution to participate in the study and followed the telephone call up with an official letter of invitation to participate in the study. In addition, the first author reached out to individual faculty at the universities and colleges. The study was described as looking at one aspect of educational skills at Palestinian universities from the perspective of both students and faculty.
One-hundred-twelve faculty responded to the survey. Characteristics of respondents are summarized in Table 4. The majority of participants were full-time male faculty. This is representative of faculty at Palestinian Universities. Forty-two percent held PhD's which is also representative of Palestinian university faculty. A variety of disciplines were represented including business, the humanities, medicine and science math and engineering. The majority of the sample had at least 2 years of teaching experience, with approximately one third having more than 10 years of teaching experience. It should be noted that there was quite a bit of missing data for both demographic questions and questions about critical thinking. Missing data rates ranged from 14% to 15% for demographic questions and 17% to 35% for the critical thinking skills. We are unsure of the reason for this relatively high rate of missing data. It may have been due to faculty not being used to this type of survey or to concern about revealing data which they may have felt was identifiable. In the case of the questions about critical thinking skills, participants may have fatigued during the survey and/or may not have fully understood the critical thinking skills. Also, as there were no incentives for completing the survey, faculty may have lacked motivation.
www.ccsenet.org/jel Journal of Education and Learning Vol. 5, No. 2;  Participants completed an on-line survey. Questions were in both Arabic and English and participants were free to respond in which ever language they were most comfortable with. The first section of the survey asked participants for demographic information. In the second section, participants were presented with the 15 skills tested on the CAT. For each skill, they were asked to indicate whether they agreed that the skill was a dimension of critical thinking or not. They were also asked whether they felt the skill was important and relevant to their teaching and a number of other questions about their understanding and teaching of critical thinking. We report here just the demographic data and data on agreement/disagreement with the critical thinking skills. Data on relevance of the skills to their teaching and other survey items will be reported in an upcoming paper.

Results
Overall there was a moderate level of agreement that the skills tested on the CAT represent critical thinking. Levels of agreement ranged from 42.0% for "Evaluate whether spurious relationships strongly support a claim" to 73.2% for "Evaluate how strongly information supports a hypothesis or interpretation". These rates are lower than those reported with faculty in the US where agreement was at least 80% across all the skills (Stein, Haynes, Redding, Ennis, & Cecil, 2007). However, this may have been because the US sample included only STEM faculty, and because the Palestinian sample included both STEM and non-STEM faculty. When agreement was examined in STEM faculty only, the agreement rate was much higher and was 80% or above for all but three skills ( Table 6). The percent agreeing may also be lower in the Palestinian sample because data were missing for between 17% and 35% of the sample. This was higher than the missing data rate for the demographic data which ranged from 13% to 25%. This may have been due to difficulty comprehending and interpreting the CAT skills or time constraints/survey fatigue which prevented busy faculty from completing the survey.

Discussion
Faculty recognition of the validity of the critical thinking assessment test is essential for both its acceptance and implementation in Palestinian institutions of higher education. Even more importantly, perhaps, recognition of such skills is vital for instructional changes to be made in the teaching and learning environment (Light, Cox, & Calkins, 2009) to achieve the critical thinking skills being advocated for nationally. The face validity of the test for faculty is a critical condition of both its use and its potential for instructional change. In this respect, the third www.ccsenet.org/jel Journal of Education and Learning Vol. 5, No. 2; study revealed moderate to high agreement among Palestinian faculty that the skills tested on the CAT are key dimensions of critical thinking, with even greater agreement among STEM faculty. This suggests that the test has a reasonable degree of face validity. The results are especially remarkable among the STEM faculty. This is of particular significance given the weight which the Palestinian educational strategy places on the development of critical thinking skills for employability and their focus on science and technology in their work and employment strategies (Palestinian National Authority, 2012).

Conclusions
The results of the 3 studies suggest that the CAT test has potential as an assessment tool for critical thinking in Palestinian higher education. It could be particularly useful as part of an national quality assessment strategy for improving learning (often missing in broader policy debates), and which has been recommended in a recent national report on undergraduate teaching practices in Palestinian higher education (Cristillo, 2009) and by the recently established Association of Palestinian Academic Developers. Together these results suggest that a large scale validation study of the CAT test in Arabic would be worthwhile. We recommend that CAT be carefully translated into Arabic-in accordance with international guidelines (Hambleton, 2005)-and tested on a large representative sample of students across Palestinian Universities. We also recommend that a full Differential Item Functioning (DIF) analysis be conducted to determine if any components of the test are biased towards certain demographic groups (Schmeiser & Welch, 2006). In addition to the larger study with students, we also recommend that a more detailed study on faculty attitudes towards the CAT be conducted. This would involve having faculty examine the test directly, receive training in how to score the test, and participate in a test scoring session. This step is particularly important because if the CAT is to be utilized for assessing critical thinking outcomes in Palestinian students it must have face validity for faculty and be considered as both practical and meaningful.
There is, moreover, a great deal of interest in developing critical thinking in students across the Middle East (Brewer et al., 2006;Al-Essa, 2009;Romanowski & Nassar, 2012). Therefore, an Arabic version of the CAT test is likely to have value well beyond Palestine as a means of measuring progress towards critical thinking goals at national, institutional and individual student levels.