agentic-ai

OpenAI's red line for AI self-improvement is fundamentally flawed

LessWrong · May 2, 2026, 2:44 PM

Epistemic status: could have been a short form.Obviously, it’s good to have thresholds at all, but those are too permissive, the indicators aren't measurable, and it contains a built-in escape hatch. 1. Too permissive The Preparedness Framework v2 defines the Critical threshold for AI Self-improvement as:“either: (leading indicator) a superhuman research-scientist agent OR (lagging indicator) causing a generational model improvement (e.g., from Open AI o1 to Open AI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months. [...] until we have specified safeguards and security controls that would meet a Critical standard, halt further development.(By default, I would expect not to stop at 5x and to go quickly at 10x, 20x, … if we reach this point.)”Both halves fire too late.The leading indicator only triggers once a model can already do AI research above the best humans. That's not early enough to act on, and we can basically ignore it.The real meat is in the lagging indicator, which requires 5x generational acceleration sustained for several months. If we are charitable, by interpreting several as 6 months, and by making the (strong) hypothesis that we go from 1x to 5x and then, exactly when reaching 5x, we don’t accelerate further, this is still roughly the equivalent of 3 years of progress before the trigger fires. (By default, I would expect to not stop at 5x, and to go quickly at 10x, 20x, … if we reach this point.)For context, Anthropic used a 2x threshold instead of a 5x.2. Escape hatch (Section 4.3)Section 4.3 of the Preparedness framework lets OpenAI lower its safeguards if a competitor releases a comparable model without comparable ones.[1]The clause attaches conditions (public acknowledgment, staying "more protective than the other AI developer," an OpenAI-internal assessment that the change "does not meaningfully increase the overall risk")But basically, even this weak red line is conditio

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

OpenAI's red line for AI self-improvement is fundamentally flawed

More in agentic-ai