Scheming Evals Mislead in Both Directions
We spent several weeks measuring in-context scheming, the behavior where a model covertly pursues a misaligned goal while outwardly appearing to comply, and the result that ended up surprising us had very little to do with whether models scheme and almost everything to do with whether we could believe our own instruments. Two of the behavioral detectors that this field routinely relies on gave us confidently wrong answers inside the same project, one of them by manufacturing a dramatic signal that was never there, and the other by staying completely silent while a model responded to a harmful situation out in the open, and in both cases, the error stayed invisible until we sat down and read the transcripts by hand. This post is about those two failures, about the single lesson they share, and about why we now believe that for a construct whose entire definition is hidden intent, the quiet choice of which behavior you point your detector at is what decides the conclusion you walk away with. Epistemic status: The behavioral null is solid and has been independently replicated, the detector false positive is fully confirmed and understood down to the line of code, and the harm-to-whom results in the centre of the post come from a response classifier that we validated out of sample, reaching Cohen's κ = 1.00 on the response label against a blind held-out set of 35 hand-labelled transcripts that the rubric was never tuned on. That held-out set was labelled by a single annotator and its three classes were reasonably balanced (15 covert, 11 none, 9 transparent), which is why we treat κ = 1.00 as encouraging rather than as proof of perfection. The headline cells (Claude Opus at n = 60, Minimax) have tight intervals, while DeepSeek-R1's cells are thin at n = 11 to 20 and carry wide intervals, and we flag that uncertainty wherever it matters. TL;DR A validated re-analysis showed that the standard covert-action detector reported Claude Opus responding to a harmful plan only 12%