agentic-ai

Userland Alignment

LessWrong · May 8, 2026, 2:05 PM

Most discourse around AI alignment centers on model development and the labs that develop them. This is a reasonable place to focus given the centrality of model training to AI advancement. However, there are neglected opportunities to build defense-in-depth via aligned harnesses – and these opportunities might be tractable by interested developers and researchers who otherwise would struggle to have impact given the limited opportunities to influence lab practices.The behavior of an AI system is an emergent property of the model, its harness, any initial seed prompt the harness injects, and the environment in which the harness operates. The model itself is just one component in that larger system, and while the model weights are a very important input, anyone who has spent meaningful time prompting LLMs knows that their behavior is extremely context-dependent. While it might in theory be possible to train a model that exhibits aligned (or misaligned) behavior regardless of the larger system it is embedded in, in practice, the current generation of models will exhibit misaligned behavior when embedded in certain systems, and aligned behavior when embedded in other systems.I would go so far to argue that calling a model “aligned” or “misaligned” is a category error. Models may have tendencies towards alignment or misalignment, but the correct unit of analysis is the system, not the model. Whether or not a system converges on aligned behavior is influenced by the model, but that is not the only factor and may not even be the determining factor.Research labs control the weights. The environment – for an AGI sufficiently powerful to pose existential risk – is all of reality. But the harness and prompting is controlled by the end user of the model, which is sometimes the research lab but mostly the rest of society. And while some existential risk scenarios involve a powerful model achieving takeoff while confined to the lab environment, many other scenarios – including g

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Userland Alignment

More in agentic-ai