Scoopfeeds — Intelligent news, curated.
Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
agentic-ai

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

LessWrong · Jun 10, 2026, 5:58 PM

(see full author list at the end)About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (Co T)? We investigate in our new paper.If models can do extensive reasoning without outputting any Co T, it would have implications for safety. Developers and deployment-time monitors couldn’t easily understand models’ motivations and catch dangerous planning. Models that reason substantially without a CoT might also drift further from human patterns of thought, since their reasoning is no longer constrained by text in the pretraining prior. As a result, they would be harder to understand and might be more likely to scheme.Extending Ryan Greenblatt’s research, we investigate this by measuring models' ability to complete tasks without any CoT on a suite of 43 benchmarks spanning different domains. We compare AI reasoning ability to humans using the estimated 50% time horizon (TH)---the typical time taken for a human to perform a task that the LLM performs with 50% success rate. We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and this TH has doubled approximately every year since 2019.Figure 1: Our no-CoT THs (green) compared to METR’s with-CoT THs (purple). Until the release of GPT-4, with- and without-CoT THs increased at a similar rate. Since GPT-4 with-CoT THs have grown at roughly twice the rate of no-CoT THs.We suggest that AI companies start to track no-CoT THs explicitly to find a lower bound on how much reasoning a model could do without revealing it to a CoT monitor or human. Our test suite does not require substantial inference compute so these evaluations are cheap to run.MethodsWe evaluated 14 frontier models from GPT-2 (2019) through GPT-5.5 (2026) on 43 benchmarks covering math, coding, knowledge, agentic tool-use, and saf

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop