Evaluating GPT-4 Turbo`s Ability to Design English Reading Test Items for Language Learners


  •  Sharifah Mofareh Alshehri    
  •  Mohammed S Alharbi    

Abstract

This study evaluates the Generative Pre-trained Transformer (GPT-4 turbo) to design English reading multiple-choice questions (MCQs) for intermediate learners by addressing a deficiency in studies that examine GPT-4 turbo`s capabilities to generate MCQs using various prompt engineering techniques and evaluate their psychometric properties. Utilizing a descriptive quantitative method, a cohort of eight-item writers and 150 preparatory students participated in the study. Both the questionnaire and the generated online test were used to collect data. The findings reveal that zero-shot prompting demonstrates the highest level of agreement compared to few-shot prompting across three of the six aspects: text coherence, question stem quality, and answer options quality. The study finds that although all generated MCQs using few-shot prompting exhibit significantly higher discrimination values than those generated using zero-shot prompting, all MCQs displayed a low level of difficulty across all three prompt engineering techniques (zero-shot prompting, few-shot prompting (two-shot & four-shot). Taken together, this study proposes some profound implications for language assessment developers by illustrating how prompt design influences both the perceived and measured quality of AI-generated items. It also contributes to the literature by providing meaningful insights into using large language models, especially GPT-4 turbo, for AIG.



This work is licensed under a Creative Commons Attribution 4.0 License.