Interrater Scoring of Public Speaking Performances in English Language Teacher Education Program

Based on the constructivist learning principles, self-assessment has been a targeted topic for many studies in the field of teacher education. Its importance and its leading to learner empowerment have been discussed for long. This current study in this line tries to move one step further by adding a correlative comparison between instructors’ and students teachers’ grading as well as searching into students’ views on self-assessment in Oral Communication Skills Course at English Language Teaching Department of a private university in Turkey. Interrater consistency was examined throughout the study. This study involves 21 student teachers who assessed their speaking performances five times using a micro-analytic rating scale. In the analysis of data, both qualitative and quantitative methods were utilized. Both data sets suggest that there is a high correlation between instructor and student teachers grading. The study has got some implications for curriculum designers, instructors and teacher candidates.


Introduction
Formative assessment methods have recently been voiced in educational context as a result of the mainstream of constructive psychology.The demands of current era have shifted the role of education from transmitting limited knowledge in a specific domain to making them lifelong learners (Dochy, Segers, & Sluijsmans, 1999;Falchikov & Goldfinch, 2000).In doing so self-assessment helps learners to become reflective practitioners (Schön, 1982) makes them active participants of their own learning (Falchikov & Goldfinch, 2000).Though self-assessment sounds promising in learning, broad range of literature stresses the validity and accuracy construct of this method.The studies indicate that the accuracy of self-assessment is enabled through experience in assessment (Falchikov & Goldfinch, 2000) and having clear criteria of assessment (Sadler & Good, 2006).In the absence of experience, the problem in self-assessment is the inflated grades given by learners who are unskilled in terms of assessment (Kruger & Dunning, 1999).It is also indicated in some studies (Ross, 1998;Ward, Gruppen, & Regehr, 2002) that students' self-assessment lacked reliability when compared to some external factors one of which is teacher assessment.For this very specific reason, searching into consistencies and inconsistencies between students' self-assessment and teacher assessment has been one of the popular areas of research in recent years (Barrot, 2015;Ünaldı, 2016;Baleghizadeh & Hajizadeh, 2014).

Self-Assessment and Teacher Assessment
It is well known that with the popularity of alternative assessment and constructivism in language teaching, self-assessment has become an important component of language classroom assessment techniques.From the perspective of Bailey (1998), self-assessment could be defined as "procedures by which the learners themselves evaluate their language skills and knowledge".In a similar way, Mousavi (2012) defines self-assessment as "an individual's own evaluation of his/her language ability and this evaluation are generally based on how good the individual is at specific language skills or how well the individual can use different styles of the language".Self-assessment provides many advantages to language learners in many ways.By being involved in self-assessment, learners can "monitor their learning process, become active learners, develop their metacognitive knowledge, enhance their own learning, develop a better understanding of the purpose of the assignment and the assignment criteria, increase their motivation and involvement, take responsibility for their own learning, and and think critically" (Searby & Ewers, 1997;Sluijsmans, Dochy, & Moerkerke, 1999;Allam, 2004;Rourke, 2013;Orsmond & Merry, 1997;AlFallay, 2004;Butler & Lee, 2010;Black, 2009;Brown & Hudson, 1998).Despite all the advantages listed above, self-assessment could be perceieved as an assessment technique which does not get the required attention as it should.The reasons why students are not given much chance to evaluate themselves include "students' tendency to overestimate or underestimate their performances as compared to teacher assessment, students' lack of assessment skills, students' tendency to perform the assessment based on potential rather than actual ability" (Lee, 2016;Brown & Hudson, 1998;Karnilowicz, 2012;Lew, Alwis, & Schmidt, 2010).Teacher assessment as opposed to self-assessment "was accompanied in students' minds by instructors, since they are the authorized persons to make decisions about students' progress and achievement" (Thawabieh, 2017).Louis and Harada (2012) present a comparison of teacher and self-assessment as follows: Table 1.Comparison between students' self-assessment and teacher-assessment

Teacher focus Student focus
Teacher tells and student listens Teachers and students are co-learners Teacher uses summative assessment Teacher and student together use formative Teacher is uncertain of the student ability to assess his work Teacher believes that self-assessment is a learnable skill.
Many researchers conducted studies pertaining to comparison between self-assessment and teacher assessment.

Studies on Teacher Assessment and Students' Self-Assessment
Many studies have investigated the correlation between teacher assessment and self-assessment in various fields of education.Since the aim of this paper is to focus on interrater consistency between students' self-grades and teacher grades in assessing public speaking performances of student teachers enrolled in the department of English language teaching, related literature comprises studies conducted on teacher assessment and self-assessment in speaking performances of language learners.In a study by Lee (2016), self-assessment and teachers' assessment were compared.The participants of the study were teachers and students of a Korean-English Program.It was concluded in the study that students' self-grades were similar to the teachers' evaluation; however, there were no similarities pertaining to content of the teacher evaluation and students' assessment.The study of Dlaska and Krekeler (2008) searched into students' ability to accurately assess their own pronunciation skills in comparison with their teachers' assessment.It was concluded in the study that students had difficulty in assessing their pronunciation skills.In another study by Lundquist, Momary, and Rogers (2013), students' self-assessment of their communication skills was compared with that of their teachers.The findings showed that students' self-assessment scores were lower as compared to scores given by the teachers.In a study by Chen (2008), students' self-assessment and teacher assessment were compared on the basis of a training program designed for students.The findings indicated that as a result of the training program students' self-assessment scores displayed a higher correlation towards the end of the training.In his study, Tavakoli (2010) explored the correlation between self-assessment and teacher assessment in a speaking test and he concluded that the correlation between students' self-rating and teacher-rating was high.

Interretar Reliability in the Context of Self-Assessment and Teacher Assessment
Interrater reliability could be defined as "one type of internal consistency measure that differs from scale reliability in that it evaluates the level of agreement among raters versus the reliability of the assessment itself" (Porter & Jelinek, 2011).There are certain factors that affect interrater reliability and these include "rater training, rater selection, accountability for accurate rating, rubric design, type of rubric scale, and pilot programs and redesign" (Graham, Milanowski, & Miller, 2012).As for the interrater reliability between self-assessment and teacher assessment, Barrot (2015) lists the factors that affect interrater reliability as follows: • Type of language skill • Rating scale Interrater reliability has been investigated in some studies searching into the correlation between self-assessment and teacher assessment (Alfallay, 2004;Butler & Lee, 2006;Chang, Tseng, & Lou, 2012in Barrot, 2015).Despite numerous studies on the comparison between self-assessment and teacher assessment (Chang, Tseng, & Lou, 2012;Karnilowicz, 2012;Ünaldı, 2016;Tavakoli, 2010;Lee, 2016), studies on interrater reliability between self-assessment and teacher assessment of speaking skill seems scarce in Turkish context where English is the medium of instruction in most educational institutions.There seems to be a gap in exploring the correlation between students' self-assessment and teachers' assessment in English speaking classes in Turkey.This study attempts to fill this gap by searching into interrater consistency between students' self-grades and teacher grades in assessing public speaking performances of student teachers enrolled in the department of English language teaching.The present study also differs from the previous similar studies on interrater consistency in that it searches into the differences and/or similarities between students' self-assessment and teacher assessment in regard to who chooses the presentation topic.As to provide some in-depth perspectives on the issue and contribute to the related literature on Turkish studies, this study examines the following research questions: 1) Is there any interrater consistency between students' self-grades and teacher grades in assessing public speaking performances of student teachers over a five-week period?
2) Is there any difference and/or similarities between students' self-assessment and teacher assessment on the basis of who chooses the topic of the presentation?
3) What are the reflections of student teachers in regard to self-grading?

Context of the Study
The present study was conducted in English Language Teaching Department of a private university in Istanbul/Turkey.Data of the study were collected during Oral Communication Skills Course over a five-week period.

Participants of the Study
The participants of the current study all of whom were freshmen were 21 teacher candidates who assessed their own speaking performances ten times using a micro-analytic rating scale.5 of the participants were male and the rest were female.At the time of the study, the participants were taking Oral Communication Skills Course and Advanced Reading and Writing Course.Their linguistic background was supposed to be similar since they all passed Proficiency Exam prepared and administered by English Preparatory Program of the university they were enrolled in.All the participants took part in the training session designed for using the micro-analytic rating scale for their oral performances.The teacher of the participants who was teaching "Oral Communication Skills Course" also took part in the study.

Data Collection Tools
The data for the current study were collected through a micro-analytic rating scale designed for speaking performances (Barrot, 2015) and students' reflection papers.The rating scale consisted of 27 items 11 of which were content-related and the rest were delivery related.There were two versions of the scale: one designed for the teacher and one designed for students.All the items were the same in both versions except for the point of view used [WU2].The scale was designed as a 5 point Likert Scale (See Appendix).The second data collection tool used in the study was the reflection papers that students wrote upon completion of each oral performance and the assessment.Participants were told to reflect their opinions pertaining to their self-assessment and teacher assessment after each presentation session.Participants were told that there were no limitations regarding the time they spent for writing the reflection papers.

Procedure
Before collecting data through students' presentations in class, all the participants attended the training session during which they were informed about both the purpose of the study and the use of rating scale they were to use after their oral performances in class.A detailed introduction of the assessment rubric was presented by the researcher.Each of the components was explained and the first step of the training session lasted until the researcher was sure that every item in the scale was clear.As the second step, all the participants watched a video-taped presentation and assessed the performance of the speaker on the basis of the rating scale to be used in the study.The rationale behind involving the participants in such a task was to make sure that they had no difficulty in assessing the speaking performance.The discussion about the assessment of the demo presentation lasted until participants had no questions.After the training session, a list of presentation topics was given to all participants and each of the participants chose the presentation topic for each week.As to search into differences and/or similarities between students' self-assessment and teacher assessment regarding who chooses the topic, each participant chose three presentation topics on his own and the other two presentation topics were assigned by the teacher.As to collect data through students' oral performances, each participant was given 15 minutes to present his/her topic in actual classroom setting.Teacher assessment through the rating scale was carried out during the performance of each student.Each student was asked to assess his/her own performance using the rating scale.The performance of each participant was video-taped and the student had a chance to either assess his/her oral performance right after the presentation or at home.The same procedure went on for a 5-week period during which all the participants delivered five different oral performances.As to explore participants' feelings and opinions about their self-assessment, they were asked to write reflection papers at home for each oral performance they delivered in class.At the end of the 5-week period, all the rating scales completed by the participants and the reflection papers were collected and compared with the teachers' assessment of the oral performances through the use of the same rating scale.

Data Analysis
Data for the current study were analyzed both quantitatively and qualitatively.Analysis of the quantitative data obtained through micro-analytic rating scale was carried out by utilizing Pearson Correlation Analysis along with descriptive analysis via SPSS 15.0 for Windows program.Thematic analysis was utilized as to analyze qualitative data collected from students' reflection papers.

Qualitative Data Analysis
Analysis of the rating scale: Prior to the parametric data analysis, normality assumptions were checked with Shapiro-Wilk Test and Q-Q Plots and the data is found to be normal (p > .05).Outlier values are controlled with the help of Skewness and Kurtosis calculations and outliers are excluded from the further analysis with the use of z values.The data analyzed with the use of the paired sample t-test in order to assess the difference between the scoring of student and teacher ratings of the presentation performance.Each week's student and teacher scores on content, delivery, and total basis were paired.There were 21 pairs on the mentioned basis (see Table 2).According to the results of the paired samples t-test seen in Table 1, there was no difference between weekly comparisons of the teacher and student ratings (p > .01).While this finding is in line with some studies which concluded that there was a consistency between students' self-assessment and teacher assessment (Chen, 2008;Butler & Lee, 2006;Tavakoli, 2010;Chang, Tseng, & Lou, 2012, Barrot, 2015), it is contradictory with the findings of the studies in which inconsistency was found between self-assessment of students and teacher assessment (Ross, 1998;Kruger & Dunning, 1999;Dlaska & Krekeler, 2008;Lundquist, Momary, & Rodgers, 2013).The findings of the current study also revealed that ratings of students and teacher differed significantly between first week and the fifth week on the basis of content, delivery and total rating.More specifically, students' ratings of the first week were significantly different from their ratings in fifth week on all content (t(17) = -3.57,p < .003),delivery (t(17) = -5.36,p < .001),and total ratings (t(17) = -5.01,p < .001).Accordingly, students rated their fifth week's presentations significantly higher on content (M 1st = 39.28,M 5th = 44.50),delivery (M 1st = 54, M 5th = 64.56),and total ratings (M 1st = 93.28,M 5th = 109.06).In the similar direction, teacher's ratings of the first week were significantly different from the ratings in fifth week on all content (t(17) = -6.12,p < .001),delivery (t(17) = -3.57,p < .003),and total ratings (t(17) = -5.12,p < .001).Specifically, teachers rated students' fifth week's presentations significantly higher on content (M 1st = 38.44,M 5th = 44.72),delivery (M 1st = 58.61,M 5th = 63.72), and total ratings (M 1st = 97.06,M 5th = 108.44).
In addition, on the basis of task type (two types: presentation topic chosen by the student or teacher), ratings of the students significantly differed on the grounds of who chooses the topic of the presentation (see

Qualitative Data Analysis
Analysis of participants' reflection papers: Reflection papers through which participants reflected their views regarding self-assessment after each presentation were analyzed via thematic analysis (Creswell, 2007).Thematic analysis is a useful method that could be used to investigate different participants' perspectives by presenting similarities and differences as well as unanticipated insights.In addition, thematic analysis helps the researcher in summarizing important features of a large data set (Braun & Clarke 2006;King, 2004).In analyzing the data via thematic analysis, the phases below were followed: 1) Familiarizing with data 2) Generating initial codes 3) Searching for themes 4) Reviewing themes 5) Defining and naming themes (Nowell et al., 2017) As to provide validity and reliability of the themes and related responses, the reflection papers of the participants were also analyzed by the second researcher and the same procedures were followed.Upon completion of the analysis of participants' views jointly carried out by two researchers, emerging themes and sample responses were listed.Table 4 displays the themes and a sample response for each theme.The analysis of participants' views pertaining to self-assessment yielded ten general themes under which participants' responses were grouped (see Table 4).We see in Table 4 that the themes fall into two categories: themes having a positive connotation (help for future career, fun, personal development, better understanding of assessment, motivation, and self-confidence) and themes having a negative connotation (stressful, lack of training, objectivity, and time consuming).It is clearly seen that the number of themes which imply participants' positive feelings about self-assessment is higher than that of the themes which imply participants' discomfort and dislike about their own assessment.The findings of the current study are in line with two groups of previous studies which (1) imply positive attitudes of students towards self-assessment (Stefani, 1994;Shahrakipour, 2012;Siow, 2015;Hanrahan & Isaacs, 2018) and (2) underline students' negative attitudes towards their assessing themselves (Falchikov, 1986;Munoz & Alvarez, 2007;Hanrahan & Isaacs, 2018).We can infer from the findings of the present study that from the perspective of the participants, self-assessment is a valuable technique as it helps students with their personal development, self-confidence and future career.In addition, participants find self-assessment as an efficient tool that leads to a better understanding of the assessment process.As opposed to positive effects of self-assessment, the participants believe that being involved in assessing their own performances may cause stress and difficulty in being objective in assessment.It was also stresses by the participants that not being involved in a training program makes it difficult for some to assess their own performances which are also viewed as a time-consuming activity.

Discussion
The aim of the current study was to search into the interrater consistency between students' self-grades and teacher grades in assessing public speaking performances of student teachers over a five-week period.The study also investigated the differences and/or similarities between students' self-assessment and teacher assessment on the basis of who chooses the topic of the presentation [WU3].The last but not the least, students' views regarding self-assessment was explored in the study.Findings reveal that there was no difference between weekly comparisons of the teacher and student ratings, which could be attributed to the assessment training or the scale used in the assessment process.In addition, results of the current study revealed that the ratings of the participating students and the teacher differed significantly between the first week and the fifth week on the basis of content, delivery and total rating.More specifically, students' ratings of the first week were significantly different from their ratings in the fifth week on all content.Accordingly, students rated their fifth week's presentations significantly higher on content, delivery, and total ratings.In the similar direction, teacher's ratings of the first week were significantly different from the ratings in the fifth week on all content, delivery, and total ratings.Specifically, teachers rated students' fifth week's presentations significantly higher on content, delivery, and total ratings.The increase in the ratings of both students and the teacher could stem from becoming more familiar with the assessment process over a five-week period.In addition, on the basis of task type (two types: presentation topic chosen by the student or the teacher), ratings of the students significantly differed on the grounds of who chooses the topic of the presentation More specifically, students' ratings significantly differ on content, delivery, and total in terms of who chooses the topic.Accordingly, students rated the presentations the topics of which were chosen by themselves higher on all dimensions: content, delivery, and total.On the other hand, teacher's ratings significantly differ on content and total in terms of who chooses the topic.Accordingly, the teacher also rated the presentations the topics of which were chosen by students themselves higher on content and total.Such findings could be attributed to the possibility of students' feeling more motivated and performing better when they choose the presentation topics.As for the views of participating students on self-assessment, findings of the study revealed that while some students have a positive attitude towards being involved in self-assessment, others are of the opinion that self-assessment is disadvantageous as it requires special training and objectivity and could be time-consuming and stressful.
The current study; however, has some limitations that need to be considered.One of the limitations that may have affected the results of the study specifically the findings pertaining to consistency between students' self-assessment and teacher assessment could be the training that students received only for a very short time.
Other factors that may have influenced the findings could be the rating scale, students' background, language skill, and the presentation topics.For these very specific limitations, in further studies, these factors should be considered from a wider perspective.In addition, further research could be conducted at a longer period with a larger group of participants to reach more generalizable results.
The current study also has some implications that foreign language teachers should consider in designing their oral communication skills course.As the study has revealed, self-assessment is viewed as a useful tool though it may be disadvantageous for some students.Since the disadvantageous aspects such as being time consuming and the difficulty in being objective could be overcome by some training, the self as well as teacher assessment should be implemented in teaching/learning process in order to raise awareness on the learning process.

Table 2 .
Paired Samples Statistics of student and teacher ratings of the rubric

Table 3 .
Paired samples statistics of student and teacher ratings of the rubric in terms of chosen subject

Table 4 .
Participants' views about self-assessmentHelp for future careerAs a teacher candidate, having a chance to evaluate myself is really helpful.I feel as if I am the teacher.