SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning



arXiv:2412.04628v1 Announce Type: new
Abstract: We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel extension of Direct Preference Optimization (DPO) designed to accommodate multiple dynamically chosen positive and negative responses for each query. SWEPO employs a weighted group contrastive loss, assigning weights to responses based on their deviation from the mean reward score. This approach effectively prioritizes responses that are significantly better or worse than the average, enhancing optimization. Our theoretical analysis demonstrates that simultaneously considering multiple preferences reduces alignment bias, resulting in more robust alignment. Additionally, we provide insights into the training dynamics of our loss function and a related function, InfoNCA. Empirical validation on the UltraFeedback dataset establishes SWEPO as state-of-the-art, with superior performance in downstream evaluations using the AlpacaEval dataset.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.