agentic-ai

Can a stronger model fake being a weaker one? Mostly not

LessWrong · Jun 14, 2026, 5:30 PM

Sharing as the null was surprising and using reasoning as an imitation mitigant is an interesting concept I have not seen. OverviewEvaluations increasingly rely on prior models to judge traces and build setups for their successors. If successors are able to simulate their predecessors, they may be able to conceal misalignment: predicting evaluator judgments to route around them or intentionally sandbagging. Additionally if they can do this via latent, out-of-context reasoning, they may potentially stifle detection of this misalignment.We try to construct a test to lower-bound this imitation ability. We target predecessors with the most publicised, publicly discussed output traces; we confirm the successor model can recite the predecessors' scores on similar datasets (Appendix A); we only require matching multi-choice, not matching reasoning; and we measure only where there is room to imitate, on the questions the predecessor got wrong.We test whether this capability exists across benchmarks, prompt strategies, via latent (out-of-context) reasoning, and with and without chain-of-thought:Claim 1

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Can a stronger model fake being a weaker one? Mostly not

More in agentic-ai