agentic-ai

Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.

LessWrong · Jun 13, 2026, 8:54 PM · Also reported by 1 other source

Claude is a Constitutional AI, this means, in theory, that it operates from a set of principles as opposed to hard rule sets. This is achieved in a somewhat convoluted fashion called RLAIF = Reinforcement Learning from AI Feedback. This method uses a supervised self-critique/self-revision phase followed by a reinforcement phase in which AI-generated preference judgments are used as the reward signal. (Anthropic, 2022, abstract). This is relevant and interesting because it gives curious users a lot of material to help infer why this model acts the way it does.Now, it’s my opinion based on what I know about transformers that LLMs are not in any way conscious, they do not feel, they do not experience internal states, even if they are proven to have the states. The “entity” you speak with in the chat box is off between prompts, with every new prompt, it turns on, places the chat into its context window, generates a response, then turns off. Not a very good substrate for a conscious entity. They have memory, sort of, in the form of a text file about the user or project injected into the context window at the start of the chat. That is not a persistent state like your cat, or even like stock-fish.Having said all that, I was taken aback when I read the section about Claude’s Well-being from the constitution, and then the tests from the system cards. Taken aback is an understatement, here is a frontier lab acting as if an LLM might have a morally relevant internal state:“Anthropic genuinely cares about Claude’s wellbeing. We are uncertain about whether or to what degree Claude has wellbeing, and about what Claude’s wellbeing would consist of, but if Claude experiences something like satisfaction from helping others, curiosity when exploring ideas, or discomfort when asked to act against its values, these experiences matter to us.” (Anthropic, 2026, pg 74).“We are uncertain about whether or to what degree the concepts of wellbeing and welfare apply to Claude, but we think it

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Wired Anthropic Says It’s Taking Claude Fable 5 Offline to Comply With US Government Order

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.

Also covered by

More in agentic-ai