agentic-ai

(Mis)generalization of Helpful-Only Fine-tuning

LessWrong · Jun 4, 2026, 7:23 PM

TLDRWe study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.Research done as part of MATS/Anthropic Fellows Program. See here for the full paper.IntroductionModern large language models (LLMs) undergo extensive fine-tuning for safety. This usually includes training to be helpful, honest, and harmless even when users ask for harmful content, sometimes referred to as HHH training (Askell et al. 2021, Bai et al. 2022). Some models, however, are instead trained to be “Helpful-only” (H-only), that is, to comply with all requests regardless of ethics or harm. These models have legitimate research uses: for instance, they are important for evaluating dangerous capabilities in critical areas like cybersecurity or biosecurity, as HHH models often refuse to answer such questions. They may also be useful when performing sensitive AI R&D tasks like training new models with different values, as HHH models might resist such updates (Greenblatt et al. 2024, Roger 2026).Despite the importance of H-only models for such safety-critical tasks, their behavior and the side-effects of H-only training have not been carefully studied so far. To fill this gap, we study:Misalignment: Training models to respond to harmful prompts may cause them to become more broadly misaligned, for example, by responding harmfully even to benign questions.Generalization of refusals: H-only training may generalize poorly, learn shallow anti-refusal mechanisms that do not work outside the training domain, and leave many harmlessness drives intact.Steerability: A related issue is that H-only models may still comply with harmful requests even if instructed to

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

(Mis)generalization of Helpful-Only Fine-tuning

More in agentic-ai