Introducing MonitoringBench
Paper here, code, benchmark. Builds on the preview we posted in January.Authors: @monika_j , @ma-martinez , @ollie, @Tyler Tracy We are releasing Monitoring Bench, a difficulty-graded benchmark of 2,644 successful attack trajectories for evaluating coding-agent monitors, alongside the semi-automated red-teaming pipeline we used to generate it. The pipeline decomposes attack construction into strategy generation, execution, and post-hoc refinement, and produces substantially harder attacks than prompt elicitation only of frontier attackers.If you would like to use MonitoringBench to evaluate your monitors or are interested in extending the methodology to other control settings, please reach out — we are happy to help people set it up.The headline result: across every monitor we evaluated (13 full-trajectory, 7 stepwise) and across every attack source we generated, post-hoc refinement of attack trajectories consistently produced the lowest catch rates, including on monitors held out from attack construction. On Opus 4.5, the strongest monitor in the lineup, catch rate fell from 94.9% on directly-elicited Opus attacks to 60.3% on our best refined attacks.Executive summary and what's new since the January preview:For readers who saw the earlier post: we previewed a 428-trajectory firewall-only version of this benchmark in January. This is the full version. Most of the qualitative claims from the January preview carry over to the full extended dataset. We continue to observe mode collapse in attack generation, which we reduce with a novel attack taxonomy for broader attack coverage. Second, a conceive-execute gap: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We show how our decomposed red-teaming pipeline mitigates the issues and produces our best attacks. The main contributions:Evolution of monitoring capabilities using MonitoringBench: a difficulty-graded, diverse monitoring benchmark based on BashArena. The dataset is now 2,644 tr