Scoopfeeds — Intelligent news, curated.
agentic-ai

Evaluating Offline Monitoring of Internal AI Agents

LessWrong · Jun 28, 2026, 1:20 PM

This work was conducted during the Gov AI Winter Fellowship 2026. Full report Executive Summary Frontier AI companies use offline monitoring to address risks from internally deployed AI agents. AI developers increasingly rely on AI agents for internal work, including for safety research and model training. At the same time, these companies are concerned that a misaligned model could exploit this access to take concerning actions, such as sabotaging efforts to understand the risks posed by AI. To identify such instances, AI companies have separate AI models called "monitors" that review transcripts of AI agents' actions and flag suspicious activity. Human reviewers examine activity flagged as suspicious by monitors, judge whether that activity is concerning, and decide on an appropriate response. This monitoring occurs offline, meaning that actions are reviewed after they have been executed rather than intercepted in real time.Companies currently assess the effectiveness of offline monitoring via synthetic attacks. To assess the effectiveness of offline monitoring, OpenAI and Anthropic use synthetic attacks – transcripts constructed to contain the kind of harmful actions a misaligned AI might take during deployment – and then check whether monitors flag them.Current reporting on assessments of effectiveness is insufficient. Given the information currently made public by Anthropic and OpenAI, external parties cannot assess the overall effectiveness of their offline monitoring. Anthropic reports how often monitors flag synthetic attack transcripts, but not how often those transcripts survive the full pipeline and reach human review; later filters can discard transcripts that were initially flagged. OpenAI reports that their monitor flags all transcripts independently flagged by human employees, but this tells us little if employees only catch obvious cases of misbehavior. As far as we are aware, only Anthropic and OpenAI have published information about their offline moni

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop