Optimisation over non-stationary distributions creates weirder minds
TLDR: Sequentially mixing training objectives incentivises different training dynamics depending on the distinguishability of the training environments and the amount of pressure for shared circuitry. We classify these patterns into three classes: ecological generalists, conditional policies, and strategy churn. We suggest careful consideration of the pressures of training dynamics can allow us to shape the minds of AI systems in more intentional and fine-grained ways.Modern LLM post-training involves interleaving many different training objectives such as mixing different reward functions with supervised fine tuning (SFT), typically in order to guard against catastrophic forgetting. However, a lot of existing theoretical work in AI safety operates under the assumption of a fixed training objective and distribution. What happens when we drop that assumption?A common intuition is that mixing training objectives merely selects for an optimiser over a weighted sum of these training objectives, so this problem nicely reduces down to single objective optimisation. On the other hand, you might suspect that the most recent objective is the only thing that matters for reasoning about a system's properties. In practice, things are not quite so simple.Intensive optimisation over any given distribution creates fragility due to Goodhart's law. A circuit that optimises too hard ends up overfitting to its current distribution, and so gets screwed over when the environment changes. If you switch out objectives quickly enough, you cull naive optimisers from past objectives through each new training phase before they have time to fully develop. By a similar token, a weighted optimiser across objectives gets continually punished: they are outcompeted by naive optimisers across any single distribution and if they meet with an environment they have not seen before, they are similarly fragile (though perhaps not as much).So what actually emerges from training under non-stationary distri