agentic-ai

Reasoning and learning about injected concepts in language models

LessWrong · Jun 24, 2026, 4:39 AM · Also reported by 1 other source

This work was done as a part of SPAR, under the mentorship of Mirko Bronzi and Damiano Fornasiere. TL;DRWe test models' ability to recover information about their activations by injecting steering vectors, and asking the LLMs to verbalize properties of them. We train models with in-context learning and test for three capabilities: Can models identify the region of layers (early, middle, late) of the injection?Can models identify the relative magnitude (low, medium, high) of the injection?Can models gate their behavior conditioned on identifying an injection with specific semantic meaning?We test all experiments on five models, with CoT disabled: Qwen3-32B, Olmo3.1-32B, Gemma-4-31B, Qwen3-8B, and Olmo3-7B. Qwen3-32B and Gemma-4-31B have high accuracy for all tasks, and generalize to unseen examples. Gemma-4-31B, most notably, is able to condition its behavior on the identification of a specific injection zero-shot with mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; }

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Hacker News Qwen-AgentWorld: Language World Models for General Agents

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Reasoning and learning about injected concepts in language models

Also covered by

More in agentic-ai