agentic-ai

[Paper] Dictionary Learning Identifiability for Understanding SAEs

LessWrong · Jun 5, 2026, 12:39 AM

Brief Summary Despite showing promise for studying the internals of neural networks, Sparse Autoencoders (SAEs) do some puzzling things, like feature-splitting, feature-absorption, or encoding dense features. Working out why they show these behaviours may help us extract more insight from SAEs, and provide principles for designing their successors.In this work I analysed dictionary learning (which SAEs approximate) to examine when and why these effects occur (a similar motivation to multiple previous efforts). I present some general-purpose theoretical approaches that I found useful in understanding these phenomena. Further, having identified a failure mode of SAEs, these tools could be applied to other optimisation problems to see if they behave better, leading to better concept extraction techniques (though this paper doesn't pursue this).Briefly, the technical contribution is the following. I study the dictionary learning optimisation problem and, following excellent earlier work, reformulate it in various ways, including showing the problem is convex in the wide-dictionary limit. One thing we definitely know about SAE representations is that they are local optima of the SAE optimisation problem. To be a local optima there must be no perturbations from the optima which decrease the loss to first order. As in earlier work, I use this to derive first-order optimality conditions which place interpretable constraints on the ways features and residuals are allowed to relate to one another in an optimal solution. If you break the constraints, you cannot be a local optima. For example, this prohibits the existence of hierarchically related features. Finally, towards the end of the paper we consider the wide-dictionary limit and show it can explain some findings about, for example, dense features.The paper can be found here. I welcome any comments or criticism from this community, as I am not yet a fluent mechanistic interpretability speaker!Illustrative ExampleTL;DR: I

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

[Paper] Dictionary Learning Identifiability for Understanding SAEs

More in agentic-ai