Evaluating Synthetic Activations composed of SAE Latents in GPT-2

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Evaluating Synthetic Activations composed of SAE Latents in GPT-2, by Giorgi Giglemiani and 4 other authors

View PDF
HTML (experimental)

Abstract:Sparse Auto-Encoders (SAEs) are commonly employed in mechanistic interpretability to decompose the residual stream into monosemantic SAE latents. Recent work demonstrates that perturbing a model’s activations at an early layer results in a step-function-like change in the model’s final layer activations. Furthermore, the model’s sensitivity to this perturbation differs between model-generated (real) activations and random activations. In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents. Our findings indicate that synthetic activations closely resemble real activations when we control for the sparsity and cosine similarity of the constituent SAE latents. This suggests that real activations cannot be explained by a simple “bag of SAE latents” lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties. Notably, we observe that our synthetic activations exhibit less pronounced activation plateaus compared to those typically surrounding real activations.

Submission history

From: Stefan Heimersheim [view email]
[v1]
Mon, 23 Sep 2024 13:46:38 UTC (1,302 KB)
[v2]
Mon, 18 Nov 2024 10:35:37 UTC (1,302 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.