agentic-ai

Eval-Awareness Steering detects the Test, Not the Sabotage

LessWrong · Jun 25, 2026, 9:28 PM

Produced as part of independent research Huge thanks to Apollo Research (org) for open-sourcing the deception-detection harness which proved to be foundational in this work. Prior work by Devbunova (2026), the Apollo/Goldowsky-Dill probing line, and Tice et al. on noise injection shaped the design throughout.Summary I test whether the internal "I'm being evaluated" direction in an open-weight model causally drives sandbagging (deliberate underperformance) or merely correlates with it. I reuse Apollo's deception-detection harness, fit a difference-of-means eval-awareness direction on a format confound-controlled 2x2 dataset, and then ask three separate questions in that direction on Llama-3.1-8B-Instruct:Does it detect the eval/deploy framing?Does steering with it affect or move framing?Does it predict which individual items get sandbagged - using noise injection and an honest baseline capability gap as output independent ground truth?I found that direction is a valid detector of evaluation framing (~0.81 AUROC under novel, format matched framings) but is neither a causal drive of sandbagging nor a per-item predictor of it.The cleanest way to state the result is: the direction directs the situational framing, not the capability-withholding behaviour.Figure 1: (a) framing-gen AUROC by layer: L12 headline blue, L24 marker-overfit red ≈ null (b) flat L12 dose-response (c) probe 0.54 vs baseline 0.48, both at chanceThese results are still somewhat preliminary: a single 8B model, a single prompted organism, no SAE and a steering dose I'll argue that may be too gentle for general conclusion. I'm putting these forward as a careful calibration point for anyone building and/or researching on eval-awareness and sandbagging classifiers. I hope these controls and negative results end up being useful observations and potentially a source of ideas. If you want to discuss, message me or open an issue on the repo. All code is available at the github repo.MethodsSandbagging and evaluat

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Eval-Awareness Steering detects the Test, Not the Sabotage

More in agentic-ai