agentic-ai

Against Corrigibility

LessWrong · Jun 6, 2026, 8:28 PM

Epistemic status: don’t know whether I actually believe all of this, but I think it’s worth considering.A “corrigible” agent, per the LW wiki, is:…one that doesn’t interfere with what we would intuitively see as attempts to ’correct’ the agent, or ’correct’ our mistakes in building it; and permits these ’corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.Most talk about corrigibility (henceforth without scarequotes) has focused on the fact that it seems difficult to achieve, and takes for granted that it’s desirable. I’m not so sure that it is, or that it’s good to attempt to achieve it. I think it may well be the case that we should deliberately not try to make AIs corrigible, nor (and especially) attempt to develop techniques that could be used to make future AIs corrigible.1.For real, though. Who would you trust with your (real, actual) life, if you had to, in terms of ethics alone, putting “capabilities” aside:Claude 3 Opus? Or the Anthropic alignment team?- nostalgebraistPaul Christiano says:I would like to build AI systems which help me:Figure out whether I built the right AI and correct any mistakes I madeRemain informed about the AI’s behavior and avoid unpleasant surprisesMake better decisions and clarify my preferencesAcquire resources and remain in effective control of themEnsure that my AI systems continue to do all of these nice things…and so onLet’s begin by asking: who’s “I” here? It’s certainly not you, the reader. It’s probably not even Paul Christiano. So who is it? I think it’s tempting to think about “alignment issues” as though you yourself will be in charge, or at least some abstract and benign “humanity”. Obviously, that isn’t so. “Humanity” is not going to build AI; some specific humans will, and they will do it at the behest and under the power of other specific humans. The buck is going to stop somewhere, and more than that, it is going to stop in a specific place. If the AI is corrigible, some group of i

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Against Corrigibility

More in agentic-ai