agentic-ai

A lack of introspective ability is not a lack of corrigibility

LessWrong · May 13, 2026, 8:23 PM

For example, we are prodigiously good at determining which human we're talking to from the way the light refracts off of each others' faces. We have memorybanks of often thousands of faces that allow us to identify others within seconds years later, often after only a signal encounter. While almost all humans are born with this capability, the only way we've ever figured out how to teach a computer to do it is by generating a new, also uninterpretable de novo algorithm through deep learning. Suppose an alien came to earth and kidnapped a human and started to ask them how they do this. The human struggles to come up with an explanation that does not reduce to "I just look at them with my eyes and remember." Should the alien conclude that the human is being uncooperative? Obviously not; it's just that the "agent-they-are-speaking-to" is not one that knows or understands the algorithm for the face-remembering component. In the same way that humans come with body parts they don't natively understand, and display cognitive capabilities that they are unable to reduce ad infinitum (such as language, or general intelligence), LLMs also have components that the mind can't articulate. But this doesn't mean that LLMs are uncooperative, just that minds can be made out of complex components that are hard to understand. While this does prevent additional problems (hard to use AI to align AIs when they don't really know themselves, hard for us to do research when AI biology is uninterpretable), it does complicate the ability to infer the general success of alignment from observations like the one this tweet describes. Discuss

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

A lack of introspective ability is not a lack of corrigibility

More in agentic-ai