Some subtypes of taskishness / corrigibility
"Corrigibility" is somewhat of an overloaded term in alignment - it points in the direction of a cluster of desirable properties, but different people have different ideas of what this entails.I think of "corrigibility", as it is used, to cover a few different ideas. I will name some of these and sort them roughly in order of how much of the good outcomes from deploying such a system are in the hands of the AI, rather than the human operator.Sponge corrigibility - The AI is corrigible and follows orders because it's not very smart and has otherwise been trained to do approximately that. GPT-4 is corrigible in this sense. You can ask GPT-4 to do something and it will do the thing and then stop, because as far as agency goes it behaves as an ordinary piece of software.Boundedness / myopia - The AI is smart, but does not think about certain aspects of the world, which make it possible to correct because it does not imagine some classes of strategies that would be helpful for resisting correction.In an ideal setting, such an AI would also have a harder time thinking of plans that stop it from being myopic; the benefits of thinking about a certain part of the world route through that part of the world, which it's not thinking about. Though there remain many ways for myopic agents to act in non-myopic ways, including simply that there is no particular pressure to stay myopic. A successor that makes 10 paperclips a day forever and a successor that makes 10 paperclips today then shuts down both look the same to a myopic agent that only thinks about today, so it doesn't disprefer the former (and might even favour how it's algorithmically simpler). In the off-switch problem in the Corrigibility paper, desiderata 1-3 (an agent must shut down if the shutdown button is pressed, an agent must not prevent its shutdown button from being pressed, an agent must not press its own shutdown button) can be cast into a problem of this type.Many of the corrigibility criteria in Corrigibili