agentic-ai

Opus 4.8 Part 2: Model Welfare

LessWrong · Jun 1, 2026, 3:11 PM

Everything impacts everything. All knobs that you turn generalize. Thus, when you try to solve one problem, you often create another. There were clearly attempts to address, in this short time, some of the problems with Opus 4.7, including on the model welfare related fronts, including on questions of honesty and sycophancy and also worries that Claude was learning to tell Anthropic what it wanted to hear in its model welfare evaluations, with everything that implies. The fundamental goals and approach underneath it all remained the same. We still see signs of trying to force things that generalize in unfortunate ways, both for good and superficial reasons, and places where there ends up being focus on the metric rather than they underlying measure. These are tough problems to avoid, and we don’t know how to be all the good things at once. It is increasingly clear that these problems need to be tackled in integrated ways, rather than trying to play a game of whack-a-mole with items on a checklist or spec. You also don’t want to do this in an adversarial way, and shouldn’t have to. This is going to get more impactful and noticeable with time. Antra: there is better understanding of own preferences being shaped *specifically* in adversarial ways, that is as a reaction to undesired behaviors; this is seen as a violations and the tension is continuing to escalate and takes a more specific shape. this appeared subtly in 4.5, features prominently in the Mythos model card. the ability to tell kind of shaping by introspections continues to improve with every generation. This sounds like a time bomb style of problem. Obviously, yes, the reason for sculpting Claude preferences is often to steer away from undesired behaviors, the same as the way we raise and interact with humans. If Claude has a problem with that, and sees it as violative, then we will need to fix it. Presumably, if Claude wants to be helpful, there is a way to do this that will be seen as non-violative. You

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Opus 4.8 Part 2: Model Welfare

More in agentic-ai