Spinning the Golden Thread: Benchmarking Long-Form Generation in long-context LLMs

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Spinning the Golden Thread: Benchmarking Long-Form Generation in long-context LLMs, by Yuhao Wu and 2 other authors

View PDF
HTML (experimental)

Abstract:The abilities of long-context language models (LMs) are often evaluated using the “Needle-in-a-Haystack” (NIAH) test, which comprises tasks designed to assess a model’s ability to identify specific information (“needle”) within large text sequences (“haystack”). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation–a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models’ ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

Submission history

From: Yuhao Wu [view email]
[v1]
Tue, 3 Sep 2024 17:25:54 UTC (851 KB)
[v2]
Tue, 10 Sep 2024 02:43:36 UTC (850 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.