View a PDF of the paper titled One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks, by Fangru Lin and 9 other authors
Abstract:Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation, and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects in canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce textbf{ReDial} (textbf{Re}asoning with textbf{Dial}ect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that textbf{almost all of these widely used models show significant brittleness and unfairness to queries in AAVE}. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for relevant future research. Code and data can be accessed at this https URL.
Submission history
From: Fangru Lin [view email]
[v1]
Mon, 14 Oct 2024 18:44:23 UTC (2,891 KB)
[v2]
Tue, 14 Jan 2025 09:52:50 UTC (4,659 KB)
Source link
lol