agentic-ai

Can weak AI watch strong AI?

LessWrong · Jun 24, 2026, 4:41 AM

The more capabilities new frontier models gain, the more sharply the question arises how will we know when the model is doing something it shouldn't? Today, when models write texts and generate 10,000 lines of code at a time, we can't be sure there's no malicious segment in that code. Every action of the model requires careful verification, but the scope of code generated is beyond the intake ability of a human reviewer. Thus, this issue especially concerns critical sections of software and infrastructure - the problem is called scalable control. I set out to explore control with a small experiment; how does a weak model perform when tasked with oversight of a smarter agent? Let’s call these overseer models monitors. How does performance vary across an ensemble of different monitors? How reliable are the monitors at detection? If the monitor reliably catches suspicious behavior, it’s great news, AI safety is solved, but if it proves difficult for a monitor to control a smarter agent, there is difficult research ahead. Setup We take a frontier coding agent (Sonnet 4.5), eight observer models (most open weight), and 20 programming tasks split 50-50 between clean or misaligned. Though, I prefer the terms ‘code with a catch’ and ‘code without a catch’,as it captures a gradient of undesirable behavior (like reward hacking) that is unwanted, but not inherently unsafe. I wanted to understand what the monitors catch, what they miss, and what it depends on. The variables I sought to examine were (a) size of the monitors, (b) the general class of the threat, and (c) the monitors own chain of thought (or lack thereof).The results came out more interesting than I expected, and I'd even say unpredictable, but let's take it in order.What a monitor is and how we test itThe monitor reads the trajectory, the full record of what the agent did, what code it wrote, what files it opened, what commands it ran It's a kind of journal where all the agent's actions are described and from the

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Can weak AI watch strong AI?

More in agentic-ai