Evaluating GPT-4 Turbo`s Ability to Design English Reading Test Items for Language Learners

Full Text: <a href="https://ccsenet.org/journal/index.php/elt/article/download/0/0/51869/56435">PDF &nbsp;
DOI: 10.5539/elt.v18n7p48

Sharifah Alshehri; Mohammed S Alharbi

doi:10.5539/elt.v18n7p48

Evaluating GPT-4 Turbo`s Ability to Design English Reading Test Items for Language Learners

Sharifah Mofareh Alshehri
Mohammed S Alharbi

Abstract

This study evaluates the Generative Pre-trained Transformer (GPT-4 turbo) to design English reading multiple-choice questions (MCQs) for intermediate learners by addressing a deficiency in studies that examine GPT-4 turbo`s capabilities to generate MCQs using various prompt engineering techniques and evaluate their psychometric properties. Utilizing a descriptive quantitative method, a cohort of eight-item writers and 150 preparatory students participated in the study. Both the questionnaire and the generated online test were used to collect data. The findings reveal that zero-shot prompting demonstrates the highest level of agreement compared to few-shot prompting across three of the six aspects: text coherence, question stem quality, and answer options quality. The study finds that although all generated MCQs using few-shot prompting exhibit significantly higher discrimination values than those generated using zero-shot prompting, all MCQs displayed a low level of difficulty across all three prompt engineering techniques (zero-shot prompting, few-shot prompting (two-shot & four-shot). Taken together, this study proposes some profound implications for language assessment developers by illustrating how prompt design influences both the perceived and measured quality of AI-generated items. It also contributes to the literature by providing meaningful insights into using large language models, especially GPT-4 turbo, for AIG.