Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study

stp2yJanuary 13, 20250 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 17 Jun 2024 (v1), last revised 10 Jan 2025 (this version, v5)]

View a PDF of the paper titled Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study, by Mingyang Song and 3 other authors

View PDF
HTML (experimental)

Abstract:Utilizing Large Language Models (LLMs) as evaluators to assess the performance of LLMs has garnered attention. However, this kind of evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this problem, we propose and study two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated evaluation rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot and few-shot regimes. Furthermore, when using GPT-4o as an evaluator in the many-shot regime, adopting MSwR as the prompt template performs better than MSoR.

Submission history

From: Mingyang Song [view email]
[v1]
Mon, 17 Jun 2024 15:11:58 UTC (221 KB)
[v2]
Mon, 24 Jun 2024 16:02:21 UTC (235 KB)
[v3]
Sun, 30 Jun 2024 13:31:24 UTC (221 KB)
[v4]
Tue, 17 Sep 2024 14:04:27 UTC (312 KB)
[v5]
Fri, 10 Jan 2025 13:23:37 UTC (708 KB)

Source link
lol

By stp2y