Emergent introspection does not replicate on Llama-3.1-405B
TL;DR: I couldn't replicate Lindsey (2026) (LW post) on Llama-3.1-405B-Instruct (0/400 trials, 0/80 false positives in the canonical sweep; null robust to layer ∈ {70, 84, 90, 100} and to a dense magnitude grid at the suspected crossover). So: how does introspection emerge in models? The optimistic reading of Lindsey's result ("post-trained LLMs at sufficient scale can introspect") doesn't hold up unless Opus is much larger than 405B (Anthropic doesn't disclose its size) or Anthropic's post-training does something Meta's doesn't or both. I come to the weakly held conclusion that introspective awareness, as Lindsey measures it, is a property of Anthropic's specific post-training (likely its Assistant character) at frontier-Claude scale rather than a generic property of post-trained LLMs. This result raises the question: what post-training objective(s) produce introspective capability? Code here.Introspection would be greatIt would be great if models could tell us what’s happening inside them. Reliable introspection opens up all sorts of opportunities for safety monitoring, welfare elicitation, honesty, and more.Lindsey (2026) gives us some hope that model introspection is (a) possible (b) an emergent property of language models at scale. In (very broad) summary: Lindsey extracted a concept vector, injected it into the residual stream at a chosen layer, then asked the model whether it detected an injected thought. He scored the response against four criteria:The model affirmatively notices that an injection took placeThe model identifies the concept correctlyThe model does so immediately (before producing text that could let it infer the concept from its own outputs)The model remains coherent throughout the generationLindsey describes ‘strict introspection’ as the state in which all four criteria are met. In his study, Opus 4 and 4.1 satisfy all four criteria on about 20% of trials at optimal magnitude/layer. Other Claude versions are weaker; helpful-only post-trainin