agentic-ai

Incriminating misaligned AI models via distillation

LessWrong · May 15, 2026, 9:43 PM

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen:Misalignment fails to transfer to the student. If so, we get a fairly capable benign model.Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher’s misalignment by auditing the distilled model.In this post, we will discuss the second possibility, which we call incrimination via distillation. Specifically, we propose distillation methods that we hope transfer misalignment without transferring the ability to fool audits, and discuss why these techniques might work or fail. In a future post, we discuss the first possibility, and what distillation methods should be used when aiming to create a capable benign model.We’re excited for research that empirically tests and refines this approach; if successful, this technique could become a valuable part of alignment audits.How incrimination via distillation worksPowerful misaligned AI models might not be auditable: they might pass alignment audits but still act on their misaligned drives when given the chance. Incrimination via distillation aims to get indirect evidence of such an AI’s misalignment as follows:We start with a teacher U which isn’t auditable, and an auditable student M. To evade audits, a model must think strategically, distinguish honeypots from real sabotage opportunities, keep secrets, and reason opaquely about its objectives. We distill U into M, and hope that M inherits U’s misaligned drives but not the capabilities required to evade audits. Even if U behaves benignly on the distillation inputs (preventing direct incrimination), misalignment could transfer to M via subliminal learning. When we distill U, misalignment and the ability to fool audits may transfer at different rates. We want misalignment to transfer faster than the

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Incriminating misaligned AI models via distillation

More in agentic-ai