Poking parts of Sonnet’s brain to make it less annoying

Poking parts of Sonnet's brain to make it less annoying



In a groundbreaking new paper (actually groundbreaking, IMO), researchers at Anthropic have scaled up an interpretability technique called “dictionary learning” to one of their deployed models, Claude 3 Sonnet. The results provide an unprecedented look inside the mind of a large language model, revealing millions of interpretable features that correspond to specific concepts and behaviors (like sycophancy) and shedding light on the model’s inner workings.

In this post, we’ll explore the key findings of this research, including the discovery of interpretable features, the role of scaling laws, the abstractness and versatility of these features, and their implications for model steering and AI safety. There’s a lot to cover, so this post will be longer and a bit more detailed than my usual breakdowns. Let’s go!

AIModels.fyi is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.


Read more



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.