The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models, by Yanjun Chen and 5 other authors

View PDF

Abstract:Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at this https URL.

Submission history

From: Yanjun Chen [view email]
[v1]
Wed, 9 Oct 2024 05:17:08 UTC (1,399 KB)
[v2]
Wed, 16 Oct 2024 04:48:08 UTC (1,399 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.