(Mis)generalization of Helpful-Only Fine-tuning
TLDRWe study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.Research done as part of MATS/Anthropic Fellows Program. See here for the full paper.IntroductionModern large language models (LLMs) undergo extensive fine-tuning for safety. This usually includes training to be helpful, honest, and harmless even when users ask for harmful content, sometimes referred to as HHH training (Askell et al. 2021, Bai et al. 2022). Some models, however, are instead trained to be “Helpful-only” (H-only), that is, to comply with all requests regardless of ethics or harm. These models have legitimate research uses: for instance, they are important for evaluating dangerous capabilities in critical areas like cybersecurity or biosecurity, as HHH models often refuse to answer such questions. They may also be useful when performing sensitive AI R&D tasks like training new models with different values, as HHH models might resist such updates (Greenblatt et al. 2024, Roger 2026).Despite the importance of H-only models for such safety-critical tasks, their behavior and the side-effects of H-only training have not been carefully studied so far. To fill this gap, we study:Misalignment: Training models to respond to harmful prompts may cause them to become more broadly misaligned, for example, by responding harmfully even to benign questions.Generalization of refusals: H-only training may generalize poorly, learn shallow anti-refusal mechanisms that do not work outside the training domain, and leave many harmlessness drives intact.Steerability: A related issue is that H-only models may still comply with harmful requests even if instructed to