Refusal in LLMs is an Affine Function

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Refusal in LLMs is an Affine Function, by Thomas Marshall and 2 other authors

View PDF
HTML (experimental)

Abstract:We propose affine concept editing (ACE) as an approach for steering language models’ behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and use it to control refusal behavior on ten different models, including Llama 3 70B. ACE combines affine subspace projection and activation addition to reliably control the model’s refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at this https URL .

Submission history

From: Thomas Marshall [view email]
[v1]
Wed, 13 Nov 2024 20:12:55 UTC (125 KB)
[v2]
Tue, 19 Nov 2024 04:53:47 UTC (974 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.