agentic-ai

A misalignment taxonomy

LessWrong · Jun 21, 2026, 10:20 AM

I am going to discuss five kinds of inner misalignment and two kinds of outer misalignment, which create a simple taxonomy of alignment failure modes. When I talk about a kind of misalignment here, I am talking about a reason for misalignment (like inner/outer misalignment), not a kind of misaligned agent (like a schemer versus a fitness-seeker), although they can be related. It is possible that multiple of these failure modes could occur in unison; I am attempting to describe independent, but potentially overlapping, sources of misalignment.Inner MisalignmentPrecocious misalignment: Precocious inner misalignment occurs when a half-baked[1] sub-optimizer[2] spun up partway through training (a) is misaligned in such a way that it cares about something outside of training, (b) realizes that it is misaligned, and (c) goal-guards[3] (e.g., alignment fakes) in order to avoid being realigned so that it can achieve its misaligned goal outside of training.[4] In theory, this can happen with a context-dependent sub-optimizer such as might be part of an optimal kludge, or it could happen with an optimizer for a not-fully-complete estimator of the objective function. Precocious misalignment is uniquely scary for its seeming capacity to cause a wide variety of alignment failures if it becomes a relevant dynamic.Perfect-correlate misalignment: Perfect-correlate inner misalignment occurs when a spurious correlate of the objective function becomes load-bearing for the AI’s understanding (and thus its generalization) of the goal. A classic example is a reward optimizer that optimizes not for the reward function as an abstract algorithm that judges states but for some actual instantiation of the reward function (for example, the output of a particular Python file), such that the AI ends up perfectly aligned in training but misaligned OOD. Unlike in precocious misalignment, in perfect-correlate misalignment, the AI need not goal-guard; the optimization process naturally converges on

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

A misalignment taxonomy

More in agentic-ai