agentic-ai

A review of “Investigating the consequences of accidentally grading CoT during RL”

LessWrong · May 7, 2026, 6:06 PM

Last week, Open AI staff shared an early draft of Investigating the consequences of accidentally grading Co T during RL with Redwood Research staff.To start with, I appreciate them publishing this post. I think it is valuable for AI companies to be transparent about problems like these when they arise. I particularly appreciate them sharing the post with us early, discussing the issues in detail, and modifying it to address our most important criticisms.I think it will be increasingly important for AI companies to have a policy of getting external feedback on the risks posed by their deployments, and in particular having some external accountability on whether they have adequate evidence to support their claims about the level of risk posed; as an example of this, see METR reviewing Anthropic’s Sabotage Risk Report. We at Redwood Research are interested in participating in this kind of external review of evidence about safety. So I am taking this as an opportunity to try out writing this kind of review. If you work at a frontier AI company, please feel free to reach out if you’d like our review of similar documents.My overall assessment is that I mostly agree with the analysis in the post.(Thanks to many Redwood and OpenAI staff for discussion that substantially informed my commentary; on the Redwood side, Alexa Pan and Ryan Greenblatt contributed substantially to this review.)Assessing the evidence that CoT training did not damage monitorabilityOpenAI’s blog post contains bad news and good news. The bad news is that, in some of the RL training for recent OpenAI models, graders (that is, LLMs that read transcripts and produce scores which inform the reward assigned to the relevant trajectory) had access to the chains of thought of the policies that were being trained. (For context, Anthropic has also accidentally graded CoTs in their training runs; Mythos, Opus 4.6, and Sonnet 4.6 had CoTs exposed in 8% of RL episodes.) They discovered this as a result of building a s

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

A review of “Investigating the consequences of accidentally grading CoT during RL”

More in agentic-ai