Scoopfeeds — Intelligent news, curated.
agentic-ai

Control Debt

LessWrong · May 10, 2026, 5:27 AM

Notes on the gap: what control evaluations assume <> implementation in labs.It is 2027, and a frontier lab grew suspicions: plausibly, their model is scheming. Not a surprise for the control team. For more than a year, they worked on a protocol. Trusted monitoring is tested on their benchmark setting, with all agent actions, as well as with suspiciousness-based defer-to-trusted triggers, thresholds from the red-teaming policy, and human escalation in higher risks. In simulation, the safety/useful ratio seems good.Now they look at production.On-call engineers pull up recent agent shell history. Four hours, then nothing. Some shell commands are logged. Tool outputs aren't. Some actions run via legacy CI processes, and their logs are deleted after one week. The agent stores memory across sessions, else it'd forget project context. Production also uses shared long-lived repo credentials, because per-task credentials were slowing down engineers. The monitor can get code diffs, just the full scaffold state that's needed to produce them is missing. Code reviews: exist. But happen mostly only after merging. The escalation policy is "ping the safety channel", and it isn't fully clear whose authority is implementing stops. Some of the logging infrastructure was refactored last quarter - by an AI agent.Each of those engineering decisions was considered well, back at the time. It unblocked an engineer, it got a feature out sooner. A tool didn't fight the people using it anymore. None seemed to be safety problems, nobody was trying to run a safety case against them. The costs must be paid now. This is a debt building up when a lab makes decisions that give speed but don't ensure to maintain conditions thorough control will need. When a lab tries to make a safety case later, accumulating shortcuts make it more difficult to get the required evidence.Pieces that are already happeningExamples at Anthropic:Anthropic's product page for Claude Code states that the majority of code at A

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop