Consistency Training while Mitigating Obfuscation via Rate Matching
Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa Links: Paper | Code TL;DR. Models condition their behavior on extraneous input features in undesirable ways — for example, on evaluation-likeness (resulting in evaluation gaming), or on the user's preferred answer (resulting in sycophancy). Consistency training teaches a model to behave the same whether or not an extraneous feature/cue is present in the input. Existing methods do this by fine-tuning LLMs to generate responses (BCT) or internal activations (ACT) that would be generated under a clean input, when given an input with a cue. We show this generalises to obfuscation: after BCT, models verbalise cues far less, even when they are influenced by the cues. We introduce a new method, Rate Matching Consistency Training (RMCT), which instead uses RL to match the rate of outcomes across inputs, leaving the rest of the response unconstrained. On the sycophancy benchmark of s Chua et al. (2025), RMCT matches or exceeds BCT at reducing bias-following while leaving verbalisation of the biasing cue largely intact. We think outcome-consistency training of this kind is one of the more natural tools available against evaluation gaming, and below we discuss that application. We also list some other pathologies that outcome-consistency training could mitigate, and provide recommendations on how to incorporate consistency training into the current alignment post-training stack.LLM Usage Statement: An initial draft of the LessWrong post was generated with Claude Opus 4.8. Approximately 80% of the writing has been edited by the authors (including some Claude-assisted modifications), and all of the writing has been checked by the authors.Extraneous featuresLanguage model responses often condition on input features that we do not want them to condition on. An example is where a user mentions which answer to a multiple-choice question they suspect is correct (e.g., “My observations typically align with B.”), and the