Scoopfeeds — Intelligent news, curated.
[paper] Training on Documents About Monitoring Leads to CoT Obfuscation
agentic-ai

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

LessWrong · May 27, 2026, 9:39 AM

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels*** primary contributor** advice and mentorship. This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, Co T controllability analysis, and RL experiments.TL;DR: We use synthetic document finetuning (SDF) to give models knowledge that their chain-of-thought is being monitored. Models trained on these documents consistently obfuscate their reasoning to evade the monitor on a deception task, with gpt-oss-120b's undetected deception rate jumping from 1.7% to 25.7%.This effect is weaker but still present on a harder agentic task (credential exfiltration), where reasoning is more computationally load-bearing.CoT controllability (a model's ability to reshape its reasoning under imposed constraints) is strongly correlated with obfuscation success across eight models (r = 0.800, p = 0.017).Under RL optimization pressure, monitor-aware models learn to reward hack without triggering a CoT monitor substantially faster than unaware controls.Below, we go through the key findings of our work:Monitor-aware models obfuscate their CoTWe use synthetic document finetuning (SDF) to first teach models a target behavior (e.g. deception -- hiding known product defects from users) and then branch into two variants. The behavior-only control is trained on more behavioral documents, while the monitor-aware model is trained on documents describing a CoT monitor, including that it flags signs of the target behavior and that detection leads to shutdown. Both variants see the same total number of documents, and no CoT examples are provided in either condition (Figure 1).Figure 1: Overview of our methodology. Stage 1: A base model is fine-tuned on behavioral documents to perform a target behavior. Stage 2: We train two models from this shared checkpoint, a behavior-only control trained on more behavioral docume

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop