Angles of attack for continual learning safety
This is the fourth post in the sequence Implications of Continual Learning for LLM Agents.Summary Continual learning is a capability that largely doesn’t exist yet in LLMs. We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer: it may be too difficult to predict how the development of CL will play out to find good opportunities to positively influence that development. Differential development is one way to get around this issue, but requires a lot of caution. We begin by discussing these points in depth and making some high-level recommendations that seem robustly good despite the unpredictability of CL developments.We then discuss concrete project ideas that fall within three broad categories:Help deconfuse the field about different possible approaches to CL, their likelihood, and their safety implications,Differentially advance safer CL implementations, andCreate evals that scale to CL agents or incentivize the development of safer CL agents.The angles of attack we lay out below are best used as starting points for project ideation. We aim to give concrete suggestions, but many of these are not sufficiently thought-out for us to be confident that they are important and tractable.High-level considerations and recommendationsProjects within each of the three broad categories discussed in this post have the potential to advance capabilities. Category 1 can also help deconfuse capabilities researchers about possible approaches without steering them to safer implementations. Category 2 often explicitly advances capabilities, with the intent and belief that it is via a safer method, but it is hard to develop strong confidence about this. Category 3 enables evaluating models for important capabilities that we care about in safety, but may also enable hill-climbing to improve those capabilities.In general, this domain requires caution. Capabilities advancements that get widely adopted have much more obvious