Intervening on Sparse, Anchored Concepts
Suppose you use mech interp to discover features in a model relating to deception. They're even causal: if you amplify them, the model displays more deceptive behavior; if you suppress them, it displays less. But how do you know you've found all such features? If you suppress them entirely or ablate the associated weights, have you removed the model's ability to be deceptive?Earlier posts in this sequence presented proofs of concept for a new alignment technique that we call Sparse Concept Anchoring (SCA). This post summarizes our recent ICLR paper[1], in which we refined the technique and demonstrated using it for practical interventions on toy models. You don't need to have read the earlier posts to understand this one. Written with help from Claude Opus 4.7.Our work is motivated by two observations.The first is that understanding activations in an already-trained model is hard, because the training process shapes representation space to suit the task rather than to aid interpretation. Networks lose plasticity early in training[2], so the structure you are left analyzing may not even be optimal for the task, let alone for interpretation: concept directions may be entangled (inseparable from other concepts) or fragmented (multiple context-dependent representations of the same thing).The second is that only a small fraction of the concepts and behaviors in a large model are directly safety-relevant: sparse autoencoders trained on Claude 3 Sonnet recovered millions of features, of which only a handful relate to safety.These observations suggest a different approach. Rather than searching for safety-relevant concepts after training, we could reserve a place for them in advance and encourage the model to put them there, while leaving the rest of the representation space alone. Sparse Concept Anchoring does this with a combination of selective regularizers and rare labeled training samples.How it worksThe training objective (loss) in SCA has three terms: mjx-container[j