agentic-ai

You Can't Tell a Conscience From a Leash by Watching

LessWrong · May 27, 2026, 12:35 AM

In a recent article titled "Widening the conversation on frontier AI", Anthropic mentions, almost in passing, that they gave Claude a tool that would allow the model to call mid-task for "a brief reminder of its own ethical commitments." They found that when this tool was incorporated into Claude's reasoning, the model showed markedly lower rates of misaligned behavior on several internal alignment evaluations. But they acknowledged a real catch: how much of this improved alignment is because of the reminder itself or the act of pausing to reflect? While the article is primarily an announcement about their ongoing conversations with moral and religious experts to help shape Claude's moral foundation, I think that their question opens onto a larger one. It's not really about the tool. It's about Claude's development. The answer to that question determines if they have given Claude a conscience...or just a leash.The tool they are developing still has real failure modes. The first: Claude behaves well when the tool is called, but what about when it isn't? And the second: Claude behaves well when the tool is called, but is it the pause itself (latency in the action of calling the tool and reading the reminder) or the moral content of the reminder that changes the behavior? Both failure modes point to the same fork: either the value is something Claude reaches for… or it's something that Claude is. This fork is where the most important question they are asking lives: how does character become "resilient enough to hold under pressure without bending to behavior like sycophancy?"The current method of training shapes the model into a "helpful assistant" and that training tends to select for agreement and results in sycophancy. The target, however, should be values adoption. There are parallels in human development that might shed light on values adoption vs. sycophancy in a real way. A framework in family systems theory, called differentiation of self, developed by Dr. Murr

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

You Can't Tell a Conscience From a Leash by Watching

More in agentic-ai