Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

Enhancing GitHub Actions CI for FastAPI: Build, Test, and Publish - PyImageSearch


View a PDF of the paper titled Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models, by Mingyang Song and 2 other authors

View PDF
HTML (experimental)

Abstract:Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose textbf{Counting-Stars}, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. textbf{Counting-Stars} comprises two counting-based multiple pieces of evidence retrieval sub-tasks: searching and reasoning. Using Counting-Stars, we conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.

Submission history

From: Mingyang Song [view email]
[v1]
Mon, 18 Mar 2024 14:01:45 UTC (929 KB)
[v2]
Mon, 25 Mar 2024 14:58:41 UTC (845 KB)
[v3]
Fri, 17 May 2024 16:58:23 UTC (366 KB)
[v4]
Thu, 12 Dec 2024 02:45:29 UTC (367 KB)
[v5]
Tue, 24 Dec 2024 01:41:28 UTC (937 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.