In a groundbreaking new paper (actually groundbreaking, IMO), researchers at Anthropic have scaled up an interpretability technique called “dictionary learning” to one of their deployed models, Claude 3 Sonnet. The results provide an unprecedented look inside the mind of a large language model, revealing millions of interpretable features that correspond to specific concepts and behaviors (like sycophancy) and shedding light on the model’s inner workings.
In this post, we’ll explore the key findings of this research, including the discovery of interpretable features, the role of scaling laws, the abstractness and versatility of these features, and their implications for model steering and AI safety. There’s a lot to cover, so this post will be longer and a bit more detailed than my usual breakdowns. Let’s go!
Source link
lol