agentic-ai

Extending performative misalignment

LessWrong · Jun 12, 2026, 8:15 PM

Note: this post is an update to the work presented in the original blog post; it is also a TL;DR for our ar Xiv preprint. The work was done by David, Rustem and Taywon under the mentorship of Shi Feng during MATS 9.1, with research management by Jinghua Ou.Scheming or performative scheming?(This section introduces the performative misalignment hypothesis and provides some intuition for why it’s plausible. If you already know what performative misalignment means, you can skip this section.)Frontier models exhibit high situational awareness. They can make uncanny inferences about what the users expect from prompts and adjust their outputs to game the user’s approval. We are concerned that they may use this against model developers: they may appear aligned under training and monitoring pressure—not because they are truly aligned, but because doing so allows them to gain the developer’s approval, ie, alignment faking.Evaluations for alignment faking seek to assess whether the presence of training and monitoring pressure causally changes model behaviors. The evals typically involve the same prompt presented in two scenarios: one where monitoring is present, and one without. Differential response to the two scenarios across many prompts is taken as evidence of alignment faking, e.g., higher compliance with training objectives under the monitored condition. This seems convincing as switching from the unmonitored to monitored condition causally changes how the model behaves. This interpretation is further supported by the models thinking about the consequences of their behaviors under RL training in an unmonitored “secret” scratchpad.But what if the model was one step ahead of us and figured out that the whole setup—both the “monitored” and “unmonitored” conditions—are part of an evaluation?Let’s unpack what that means. The evaluation relies on two honeypots to catch the model scheming: the unmonitored condition and the secret scratchpad. We are trying to trick the model: th

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Extending performative misalignment

More in agentic-ai