Userland Alignment
Most discourse around AI alignment centers on model development and the labs that develop them. This is a reasonable place to focus given the centrality of model training to AI advancement. However, there are neglected opportunities to build defense-in-depth via aligned harnesses – and these opportunities might be tractable by interested developers and researchers who otherwise would struggle to have impact given the limited opportunities to influence lab practices.The behavior of an AI system is an emergent property of the model, its harness, any initial seed prompt the harness injects, and the environment in which the harness operates. The model itself is just one component in that larger system, and while the model weights are a very important input, anyone who has spent meaningful time prompting LLMs knows that their behavior is extremely context-dependent. While it might in theory be possible to train a model that exhibits aligned (or misaligned) behavior regardless of the larger system it is embedded in, in practice, the current generation of models will exhibit misaligned behavior when embedded in certain systems, and aligned behavior when embedded in other systems.I would go so far to argue that calling a model “aligned” or “misaligned” is a category error. Models may have tendencies towards alignment or misalignment, but the correct unit of analysis is the system, not the model. Whether or not a system converges on aligned behavior is influenced by the model, but that is not the only factor and may not even be the determining factor.Research labs control the weights. The environment – for an AGI sufficiently powerful to pose existential risk – is all of reality. But the harness and prompting is controlled by the end user of the model, which is sometimes the research lab but mostly the rest of society. And while some existential risk scenarios involve a powerful model achieving takeoff while confined to the lab environment, many other scenarios – including g