[paper] Training on Documents About Monitoring Leads to CoT Obfuscation
Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels*** primary contributor** advice and mentorship. This is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, Co T controllability analysis, and RL experiments.TL;DR: We use synthetic document finetuning (SDF) to give models knowledge that their chain-of-thought is being monitored. Models trained on these documents consistently obfuscate their reasoning to evade the monitor on a deception task, with gpt-oss-120b's undetected deception rate jumping from 1.7% to 25.7%.This effect is weaker but still present on a harder agentic task (credential exfiltration), where reasoning is more computationally load-bearing.CoT controllability (a model's ability to reshape its reasoning under imposed constraints) is strongly correlated with obfuscation success across eight models (r = 0.800, p = 0.017).Under RL optimization pressure, monitor-aware models learn to reward hack without triggering a CoT monitor substantially faster than unaware controls.Below, we go through the key findings of our work:Monitor-aware models obfuscate their CoTWe use synthetic document finetuning (SDF) to first teach models a target behavior (e.g. deception -- hiding known product defects from users) and then branch into two variants. The behavior-only control is trained on more behavioral documents, while the monitor-aware model is trained on documents describing a CoT monitor, including that it flags signs of the target behavior and that detection leads to shutdown. Both variants see the same total number of documents, and no CoT examples are provided in either condition (Figure 1).Figure 1: Overview of our methodology. Stage 1: A base model is fine-tuned on behavioral documents to perform a target behavior. Stage 2: We train two models from this shared checkpoint, a behavior-only control trained on more behavioral docume