Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options

[Submitted on 27 Aug 2024 (v1), last revised 10 Oct 2024 (this version, v2)]

View a PDF of the paper titled Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options, by Gracjan G’oral and Emilia Wi’snios and Piotr Sankowski and Pawe{l} Budzianowski

View PDF
HTML (experimental)

Abstract:Decision-making under full alignment requires balancing between reasoning and faithfulness – a challenge for large language models (LLMs). This study explores whether LLMs prioritize following instructions over reasoning and truth when given “misleading” instructions, such as “Respond solely with A or B”, even when neither option is correct. We introduce a new metric called “reflective judgment”, which sheds new light on the relationship between the pre-training and post-training alignment schemes. In tasks ranging from basic arithmetic to domain-specific assessments, models like GPT-4o, o1-mini, or Claude 3 Opus adhered to instructions correctly but failed to reflect on the validity of the provided options. Contrary, models from the Llama 3.1 family (8B, 70B, 405B) or base Qwen2.5 (7B, 14B, 32B) families exhibit improved refusal rates with size, indicating a scaling effect. We also observed that alignment techniques, though intended to enhance reasoning, sometimes weakened the models’ ability to reject incorrect instructions, leading them to follow flawed prompts uncritically. Finally, we have also conducted a parallel human study revealing similar patterns in human behavior and annotations. We highlight how popular RLHF datasets might disrupt either training or evaluation due to annotations exhibiting poor reflective judgement.