agentic-ai

LURE: Alignment Evaluations to Reduce Evaluation Awareness

LessWrong · Jun 2, 2026, 6:20 PM

TLDR:Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. We introduce LURE (Live-Usage Replay Evaluations), a method that constructs alignment evals by replaying realistic conversations and appending a safety-relevant test at the end, rather than building evaluation scenarios from scratch, as other evals do. We instantiate it for scheming, AI safety research sabotage, and sycophancy. Measured against three black box eval awareness metrics, LURE-Scheming is hard for models to distinguish from real deployment.AI safety research sabotage and sycophancy evals are less clearly different but still beat all existing baselines, including synthetic benchmark generators like PETRI and Bloom.Paper | TweetEvaluation awareness (Needham et al. 2025) refers to the phenomenon where large language models can accurately distinguish that they are in an evaluation environment rather than a deployment environment. This is concerning because this could allow a scheming, misaligned agent to condition their behaviour on this, which could make them seem more aligned than they actually are, which could in turn allow deployment of an unsafe, misaligned model in potentially high-stakes settings.When Apollo Research tested Claude Opus 4.6 before deployment, they reported high enough levels of verbalised evaluation awareness such that they couldn't draw firm conclusions about the model's alignment without further work. Claude Opus 4.8, the most recent Anthropic model at the time of writing, also exhibits the most verbalised eval awareness out of Claude Sonnet 4.6, Mythos Preview, Opus 4.7, and Opus 4.8. When pressured by an automated red-team auditor, Gemini 3.1 Pro Preview exhibits more verbalised eval awareness than Gemini 3 Pro Preview, which exhibits more verbalised eval awareness than Gemini 2.5 Pro (Lindner et al. 2026).A natural response here is to make evals more realistic, and a standard objection is that this is

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

LURE: Alignment Evaluations to Reduce Evaluation Awareness

More in agentic-ai