Verbalised evaluation awareness in language models has little effect on their behaviour
TL;DR: We provide evidence that the presence of verbalised evaluation awareness (VEA) in Co Ts does not imply eval gaming. We tested this across 8 open-weight LRMs and 4 benchmarks (safety, alignment, moral dilemmas, political opinion) by comparing answer distributions on the same prompts, with and without VEA in the Co T. We find that overall distribution shifts are negligible to small across all benchmarks and experiments, challenging the common assumption that evaluation awareness equals (safety-critical) eval gaming.See full paper here: https://arxiv.org/abs/2605.05835 Introduction A common assumption in the AI safety literature is that models that notice they are being evaluated change their behaviour accordingly. The worry is that evaluation awareness enables sandbagging, alignment faking and scheming. Evidence for evaluation awareness in LRMs is well documented. However, evidence for downstream behavioural shifts as a consequence is sparse. Most existing studies either provoke evaluation gaming by using model organisms fine-tuned for that purpose, demonstrate evaluation gaming on a narrow task, or only look at a single model or model family, which makes it hard to generalise. For instance, recent work published in parallel to ours came to different conclusions, though the reported causal results are based on a single model-benchmark combination (Kimi K2.5 on the Fortress benchmark). Our aim was to test this assumption in a more robust manner.Our study concerns verbalised evaluation awareness (VEA), i.e., explicit reasoning about the possibility of being evaluated or monitored. We examined whether VEA causally influences model behaviour across eight open-weight frontier LRMs and four benchmarks (safety, alignment, moral reasoning, political opinion) using model prefilling to inject or remove VEA from CoTs and measuring how these interventions shift model behaviour (off-policy experiments). We also compared outputs from CoTs that naturally contained VEA with thos