agentic-ai

Is ProgramBench Impossible?

LessWrong · May 8, 2026, 5:04 PM

Program Bench is a new coding benchmark that all frontier models spectacularly fail. We’ve been on a quest for “hard benchmarks” for a while so it’s refreshing to see a benchmark where top models do badly. Unfortunately, Program Bench has one big problem: it’s impossible!What is Program Bench?Program Bench tests if a model can recreate a program from a “clean room” environment. The model is given only a bit of documentation and black-box access to the program (all the programs are CLIs), then tasked with re-implementing it.How does ProgramBench know if the implementation is correct? It also generates a bunch of unit tests for the program[1]. The re-implementing coding agent doesn't have access to any of those tests. The coding agent only considers a task “resolved” if it passes all of the tests and “almost resolved” if it passes 95% of them.Why is this problematic?Obscure behavior can enter the unit tests without being in the clean room path. An extreme version of this is a backdoor: program that behaves in one way most of the time but behaves totally differently when exposed to a specific string. This wouldn't make a task literally impossible, just incredibly hard in a way that is orthogonal to intelligence. A back doorThe model is will not reasonably try that string, unless it knows about it from the docs or from some sort of "gray box" access / reverse engineering[2]. The authors are aware of this problem and even mention it in the paper: Could obscure program behaviors be impossible or arbitrarily hard to discover? [...]Conceptually, one scenario where behavior is borderline impossible to discover is if an executable supports functionality that is not communicated or documented via any observable channel. In other words, there is functionality that is not revealed by the README.md, --help flag standard output, or any artifacts that could be unveiled by typical exploratory actions.This seems like a theoretical issue, does it actually happen?I think so!One of the task

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Is ProgramBench Impossible?

More in agentic-ai