Reasoning and learning about injected concepts in language models
This work was done as a part of SPAR, under the mentorship of Mirko Bronzi and Damiano Fornasiere. TL;DRWe test models' ability to recover information about their activations by injecting steering vectors, and asking the LLMs to verbalize properties of them. We train models with in-context learning and test for three capabilities: Can models identify the region of layers (early, middle, late) of the injection?Can models identify the relative magnitude (low, medium, high) of the injection?Can models gate their behavior conditioned on identifying an injection with specific semantic meaning?We test all experiments on five models, with CoT disabled: Qwen3-32B, Olmo3.1-32B, Gemma-4-31B, Qwen3-8B, and Olmo3-7B. Qwen3-32B and Gemma-4-31B have high accuracy for all tasks, and generalize to unseen examples. Gemma-4-31B, most notably, is able to condition its behavior on the identification of a specific injection zero-shot with mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; }