agentic-ai

Why Do Naive SFT Filters For Safety Properties Fail?

LessWrong · Jun 14, 2026, 7:45 PM

This is the fourth in a series of informal research updates from the Google Deep Mind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hypotheses for why SFT filtering fails.TL;DR:We discuss seven hypotheses for why SFT filtering works surprisingly poorlyWe analyze three hereditary traits that SFT-only Gemini has that other models do not: negative emotion, date confusion, and blackmail in the (highly contrived) agentic misalignment scenarioWe use a “post-training diffing pipeline” between Gemini and Olmo to show that the cause of date confusion and blackmail is largely surprising transfer of behaviors from the SFT teacher model.Notably, there exist small sets of prompts where switching the teacher model for the rollout removes date confusion and blackmail, but dropping the prompts does not.Negative emotion is less affected by the teacher model, but this may be because the Olmo prompt distribution we are SFTing on underspecifies the behavior.Takeaways:It’s hard to remove behaviors via filteringBut if you can get a teacher model to have a behavior (e.g. via RL), then transferring that in the future is easier“Spooky” generalization can happen in practice; we still don’t know the exact datapoints or characteristics of data that cause behaviors to still transfer after filteringInitial HypothesesThere are a a number of possible hypotheses for why filtering of post-training data does not work better:“Simple” generalization: there are some examples that have very mild versions of the trait that are hard to detect and filter out, and these bleed over to more obvious forms of the trait in evals.Subliminal learning: SFT

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Why Do Naive SFT Filters For Safety Properties Fail?

More in agentic-ai