23
May
Yesterday, the AI startup Anthropic published a paper detailing the successful interpretation of the inner workings of a large language model (LLM). LLMs are notoriously opaque — their size, complexity, and numeric representation of human language have hitherto defied explanation — so it’s impossible to understand why inputs lead to outputs. Anthropic used a technique called dictionary learning, leveraging a sparse encoder to isolate specific concepts within its Claude 3 Sonnet model. The technique allowed them to extract millions of features, including specific entities like the Golden Gate Bridge as well as more abstract ideas such as gender bias. They…