Scoopfeeds — Intelligent news, curated.
Introspection or entropy? Re-examining concept-injection “introspection” in open models
agentic-ai

Introspection or entropy? Re-examining concept-injection “introspection” in open models

LessWrong · Jun 25, 2026, 5:37 AM · Also reported by 1 other source

Thanks to Joshua Joseph, Dillon Plunkett, and Julian Huang for their feedback and for helping me refine these ideas.Anthropic recently reported that language models can “introspect.” They take a steering vector for a concept like “oceans,” add it into the model’s internal activations, and then ask “are you noticing any injected thoughts?” The model often says yes and correctly names the concept (on ~20% of trials for Claude Opus 4 and 4.1, at the optimal injection layer and strength.) The paper interprets this as evidence that the model is capable of detecting and reporting its own internal states, rather than merely confabulating an introspective-sounding answer. Summary results/takeawaysUsing open-source replications of Anthropic’s results, I find that concept injection inadvertently perturbs the model’s entire response distribution, raising the “YES” logit on questions that aren’t related to introspective awareness at all (e.g., “Is the Earth flat?” or “Is the sun smaller than the moon?”). A “yes” answer to the introspection prompt, therefore, is not by itself evidence that the model “detected” an injected thought.I extend prior work (that obtained similar results) by finding a more precise explanation for why this occurs than a simple ‘Yes-bias’: concept injection (in this context) raises the entropy of the model’s output, compressing logit differences between ‘Yes’ and ‘No’ toward 0. Once the model commits to “Yes, I detect an injected thought,” the prompt structure then makes it likely to continue by naming the injected concept itself (after “the injected thought is…”), since the steering vector is still active and increasing the probability of those tokens! I call this phenomenon concept leakage. I test this explanation empirically with a concept-mismatch test: I inject one concept’s steering vector (e.g., “oceans”) while asking the model about a different one (e.g., “do you detect an injected thought about masquerades?”). If the model were truly reading its

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories

Also covered by

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop