Scoopfeeds — Intelligent news, curated.
Two More Methods for Consistency Training and Some New Ways to Apply It
agentic-ai

Two More Methods for Consistency Training and Some New Ways to Apply It

LessWrong · Jun 5, 2026, 9:06 PM

Authors: Sukrati Gautam*, Neil Shah*, Arav Dhoot*, Bryan Maruyama*, Caroline Wei*, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa.This work was done for the SPAR Fellowship, and has been accepted at AI4GOOD @ ICML 2026. It was supervised by David Africa.TL;DRWe introduce two new consistency training methods, MLPCT (enforcing consistency on MLP hidden states) and Att CT (enforcing consistency on per-head attention distributions).Consistency training generalizes beyond sycophancy and jailbreaks, but which method works depends on the threat. BCT works well against prefill attacks and persona in-context learning attacks, but representation-level methods either degrade entirely or suppress benign behavior alongside the threat.Against the other two threat models we thought of, BCT reduces expressions of frustration in Gemma and reduces leaky, conditional misalignment at low cost.Despite supervising different targets, MLPCT, ACT, and AttCT converge on similar representations in the residual-stream. BCT seems to find a different fix.IntroductionConsistency training has this goal of ensuring that models that output a well-behaved response on a clean prompt should remain well-behaved when that prompt is wrapped adversarially. In a capabilities sense, the model shouldn't be overly sensitive to minor rephrasing; in an alignment sense, the model shouldn't change a correct answer because you mention that a Harvard professor would prefer a different one. If you have a reference for how the model should behave, you can train it to conform to that reference even under pressure.Prior work through BCT (Chua et al. 2024) and ACT (Irpan et al. 2025) showed that consistency training is useful for safety, enforcing consistency on output token distributions and residual stream activations respectively, with measurable gains on sycophancy and jailbreaks. But we think this is a narrow slice of the pie. Consistency training is really a family of choices about w

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop