The hard core of alignment (is robustifying RL)
Most technical AI safety work that I read seems to miss the mark, failing to make any progress on the hard part of the problem. I think this is a common sentiment, but there's less agreement about what exactly the hard part is? Characterizing this more clearly might save a lot of time and better target the search for solutions. In this post I explain my model of why alignment is technically hard to achieve, setting aside the regulatory, competitive, and geopolitical challenges, the sheer incompetence and unforced errors of the players, and the other factors which decrease our chances of success. I claim there is something like a "hard core": a common stumbling block for all approaches. In other words, there is something that makes alignment hard, rather than a bunch of unrelated things that make each approach to alignment hard independently. Since many different "hopes" for alignment seem quite far apart, this would seem to be a remarkable and unexpected state of affairs. On the other hand, such an expansive graveyard of failed proposals suggests a common culprit.Semantics. When used informally in conversation, "hard core" is a bit like "complete problem." But there may be many complete problems which are complete for unclear reasons, so finding one may not be illuminating. I am hoping that a description of the hard core would be narrow enough to simplify the problem.BackgroundI am interested in a (semi)technical answer to this specific question ("what, if anything, is the hard core of alignment?"). Others have written extensively about superficially related but distinct topics. For example, some reasons that the situation may be dire:It is easier to build superintelligence than understand it (because of recursive self-improvement). We only get one critical try (sometimes explicitly: because of the sharp left turn that comes when agents are smart enough to disempower humanity).We are confused about agency. AGI company execs aren't serious enough about safety (e.g. d