View a PDF of the paper titled Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts, by Andrew Halterman and Katherine A. Keith
Abstract:Codebooks — documents that operationalize concepts and outline annotation procedures — are used almost universally by social scientists when coding political texts. To code these texts automatically, researchers are increasing turning to generative large language models (LLMs). However, there is limited empirical evidence on whether “off-the-shelf” LLMs faithfully follow real-world codebook operationalizations and measure complex political constructs with sufficient accuracy. To address this, we gather and curate three real-world political science codebooks — covering protest events, political violence and manifestos — along with their unstructured texts and human labels. We also propose a five-stage framework for codebook-LLM measurement: preparing a codebook for both humans and LLMs, testing LLMs’ basic capabilities on a codebook, evaluating zero-shot measurement accuracy (i.e. off-the-shelf performance), analyzing errors, and further (parameter-efficient) supervised training of LLMs. We provide an empirical demonstration of this framework using our three codebook datasets and several pretrained 7-12 billion open-weight LLMs. We find current open-weight LLMs have limitations in following codebooks zero-shot, but that supervised instruction tuning can substantially improve performance. Rather than suggesting the “best” LLM, our contribution lies in our codebook datasets, evaluation framework, and guidance for applied researchers who wish to implement their own codebook-LLM measurement projects.
Submission history
From: Andrew Halterman [view email]
[v1]
Mon, 15 Jul 2024 14:20:09 UTC (377 KB)
[v2]
Thu, 9 Jan 2025 14:35:36 UTC (886 KB)
Source link
lol