The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning



arXiv:2411.18924v1 Announce Type: new
Abstract: This study investigates the impact of example selection on the performance of au-tomated essay scoring (AES) using few-shot prompting with GPT models. We evaluate the effects of the choice and order of examples in few-shot prompting on several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119 prompts with different examples, and we calculate the quadratic weighted kappa (QWK) to measure the agreement between GPT and human rater scores. Regres-sion analysis is used to quantitatively assess biases introduced by example selec-tion. The results show that the impact of example selection on QWK varies across models, with GPT-3.5 being more influenced by examples than GPT-4. We also find evidence of majority label bias, which is a tendency to favor the majority la-bel among the examples, and recency bias, which is a tendency to favor the label of the most recent example, in GPT-generated essay scores and QWK, with these biases being more pronounced in GPT-3.5. Notably, careful example selection enables GPT-3.5 models to outperform some GPT-4 models. However, among the GPT models, the June 2023 version of GPT-4, which is not the latest model, exhibits the highest stability and performance. Our findings provide insights into the importance of example selection in few-shot prompting for AES, especially in GPT-3.5 models, and highlight the need for individual performance evaluations of each model, even for minor versions.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.