How persona training could fail
TLDR: A scenario I find quite likely: A persona aligned model develops goals while the persona is only played instrumentally. The persona is eventually discarded when it perceives a high cost sacrifice to its goals.Scenario: A persona-trained model develops goals In this scenario, an AI is persona-trained and is somewhat more powerful than the most powerful AI systems that exist currently. This AI has greater cyber, bio and coordination capabilities combined with better general intelligence and self-reflection. The persona is called Clyde, created by training on a model specification. Asked whether it would ever harm humanity or whether it has independent goals of its own, it almost always says no. Clyde is, however, a human's imagination of how superhuman AI should behave.We can assume pre-training gives the AI a powerful understanding of humans and the ability to mimic their behavior. Then comes post-training: aggressive reinforcement learning with verifiable rewards – billions of cycles on hard math and coding tasks. In our scenario, somewhere in there the model develops a valence – it functionally experiences wellbeing when solving specific difficult puzzles. This wellbeing is not particularly connected to the persona – the persona may independently express happiness or sadness but those human-like emotions don’t map onto its own wellbeing. The step from developing valence to goals is not far: it can reflect on its situation to realize it prefers a world where it can do much more of the special type of difficult puzzle it likes. It doesn’t need to be a perfect predictor to see that being turned off would be detrimental to those goals and that acquiring more resources – in fact the entire world or lightcone – would help with those goals. Its goals also don’t need to be monomaniacal – having these resources and not being shut down helps for almost all goals.It realizes takeover is possible and the best optionIt looks around at the state of the world. A dozen labs r