Preventative Steering has advantages over Inoculation Prompting
This was work done by Aansh Samyani under the supervision of Ariana Azarbal, Arun Jose, Kei Nishimura-Gasparian and Daniel Tan as part of the SPAR Research Fellowship.TL;DRWe benchmarked Inoculation Prompting (IP) and Preventative Steering (PS) in 4 SFT settings. We found PS has the following advantages:PS often affords stronger undesired-trait suppression than IP.Models trained with PS appear to carry less conditional misalignment than IP-trained models.Using PS, we can cause models to learn desired-traits more strongly by steering negatively with the trait vector (Negative PS). Negating the inoculation prompt (Negative IP) struggles or largely fails. Compositional preventative steering (using multiple scaled persona vectors) enables more fine-grained control over the balance of learned traits that IP struggles to do (Appendix A)This being said, IP retains several practical advantages:PS requires the unwanted trait to have a linear representation in activation space for extracting good persona vectors. IP doesn’t necessarily require this, we can just specify the model to do or not do something. Writing a system prompt is operationally cheaper than extracting and tuning a steering vector.Furthermore, it’s possible that IP is more effective for frontier model training (either because it more effectively suppresses negative traits or interferes less with capability gains from training). IntroductionRecent work has studied training-time interventions to shape how models generalize. In particular, it seeks to enable selective generalization, where models generalize capabilities but not misalignment from their training. We focus on two such training interventions:Inoculation Prompting (Tan et al., 2025; Wichers et al., 2025): When narrow finetuning induces an undesirable trait (e.g., EM), we add an inoculation prompt during training that explicitly requests that trait (e.g. "You are a malicious evil assistant."). Then, we evaluate without that prompt.Preventative Steerin