How might continual learning affect safety and alignment?
This is the third post in our sequence Implications of Continual Learning for LLM Agents.Summary We argue that continual learning (CL) has two major potential safety implications: it may enable changes to LLM goals and values after deployment, and it eliminates the last-mover advantage held by current safety interventions.We identify three pathways for goal and value change during deployment. First, loss of developer-side control over generalization. Second, value systematization, induced when an agent reasons about and revises its objectives, a process we call reflective goal-formation. Third, memetic effects, where goals and values spread between LLM instances through shared memory banks or online learning.We also explore three potential problems caused by safety interventions losing the last-mover advantage. First, pre-deployment evaluation results may become less informative about models that people use in practice. Second, pretraining data filtering may become less useful. Third, several AI control protocols may be affected, depending on the CL agent's implementation.After discussing these safety implications, we distinguish between different settings in which risks from CL agents might materialize. We finish by highlighting potential alignment benefits of CL. The figure below summarizes this structure.Some important distinctionsBefore getting into specific risks, we want to briefly discuss some distinctions between different kinds of CL agents that might be developed. We’ll refer to those distinctions throughout the post. All of the risks we discuss are much more severe if CL updates are unbounded and inscrutable (e.g., because they are weight-based rather than text-based, though it’s possible for weight-based updates to be interpretable), as opposed to bounded and legible. By unbounded updates, we mean that the CL agent can drift arbitrarily far from its pre-deployment checkpoint that was subjected to extensive behavioral evaluations. Those features imply that