Catastrophic Forgetting and Safety Erosion Are Driven by the Same Mechanism and Should Be Monitored by the Same Tools
This paper is about an observation I made, as an outsider to two distinct research currents in artificial intelligence, regarding the similarity of the phenomenon they each observe and the potential for these two fields to converge toward shared tools. You will find a mix of established results, some unreviewed papers, and my own proposals, all clearly labeled so as not to be misleading.I enjoy thinking and learning about the stakes and the broad principles of LLMs, and I often have a "why not this way" reflex. This paper grew out of a discussion with an LLM about catastrophic forgetting and AI safety, among other things the erosion of safety rules under fine-tuning. In both cases my intuition told me that these two things had to be handled during training if we ever wanted real continual learning (CL). As I read through the literature on these two research currents, I noticed that the two seemed to be dealing with the same subject. What followed was joint work between me and my agent for reading the literature, writing the code, and studying the results, up to testing the idea on a small model.Thesis. Safety behaviors are not mechanically different: they too are learned features sitting in the same loss landscape as ordinary learned capabilities. They erode through the same gradient-interference process that drives catastrophic forgetting. This implies that the two communities' tools are interchangeable, and that the tools for observing and surgically modifying gradients should transfer from one domain to the other, though as far as my research goes neither community seems to have tested it.1. Two Literatures, One GradientOne technique against catastrophic forgetting is protecting weights when training a new task, to prevent the new knowledge from overwriting earlier weights (EWC, Kirkpatrick et al. 2017). On the side of protecting safety behaviors, SafeGrad (Yi et al.) proposed removing the components of an update that would conflict with the safety objective, to