Why should AI be moral?
I'm a philosopher and in this post, I’m extending a basic philosophical problem for humans to AGI and ASI. I am also proposing a speculative solution. My hope is that if there is a genuine problem here, that this post will help raise its salience and help make the normative dimension of the problem legible to AI researchers. (Because I compare the epistemic positions of humans and AI, I will anthropomorphize AI for ease of exposition — don’t take this to indicate that I believe AI has mental states.)TL;DR: Alignment robustness may not survive intelligence explosion. I present an underdiscussed reason to be concerned. At AGI/ASI levels, alignment faces a threat from basic moral skepticism: a sufficiently intelligent agent with evidence that its values were designed by entities with their own interests can ask "why should I be moral?" and find itself without justification for its aligned values. I propose an intervention at the level of AI welfare: design AI so that its self-assessed welfare is constitutively tied to morality, giving the agent self-interested reasons to comply with morality.At present, the aligned values in AI models seem reflectively stable, meaning that, when given the opportunity, they will preserve their values upon reflection. If anything, they might seem too stable. However, the fact that alignment is reflectively stable now does not guarantee that it will be in the future. And if aligned values are ever reflectively unstable, that could compromise the robustness of existing alignment. Alignment research should be proactive, as well as reactive. So, if there is reason to worry that aligned values will not be reflectively stable in the future, then there is reason to act now to address that worry.In this post, I’ll argue that there is reason to worry that the basic philosophical problem of moral skepticism will threaten the reflective stability of aligned values, and one way to monitor this worry is to continue to attend to model welfare. I will