agentic-ai

Exploring Generalization in NLA's

LessWrong · Jun 25, 2026, 11:42 PM

Recently, I was reading anthropic's paper on NLA's[1] and for a person who works on steering, it was an interesting and thought-provoking paper. In this post I would like to go through my reproduction and some of the experiments I did on them.Training and Architecture I'm going to touch little on architecture here because the paper already covers them, I add it here so that it could make little sense or give a refresh while reading. So, we basically train 2 models,Activation Verbalizer (AV): Injects a raw activation vector into the embedding stream at a reserved placeholder token then generates a free-text description for the injected activation vector; it must also produce descriptions specific enough that the AR can invert them - vague descriptions get penalized while reconstructing.Activation Reconstructor (AR): It reads the AV's description, reconstructs the activation.On the training part, before that though, I streamed 100k FineWeb documents, extracting layer-20 residual stream activations at 10 random positions per doc. For gold descriptions, I used Kit Fraser-Taliente's checkpoints[2] as a labeler. I initially wanted to reproduce this work he did it with Qwen and see if I can do it, initially I had cost concerns, but I still proceeded too.So, training consisted of 3 stages,AR SFT: I trained a truncated version of Qwen2.5-7B from layers 0-20 only with the linear head projecting to 3584 dims using MSE loss against true activations. The truncation mattered in cost cutting because the AR only needs to up to layer 20 activations, running all layers is a growing cost and time.AV SFT: Standard SFT of the full model where model learns to inject vector and produce a description in the expected format.RL with GRPO: In this part AV generates 4 candidate descriptions per activation, the AR scores each one (reward= −MSE between AR reconstruction and true activation), GRPO updates the AV toward higher reward descriptions. notes: A few things really helped me here, first, v

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Exploring Generalization in NLA's

More in agentic-ai