Most Current Model Organisms Leaks: Perplexity Differencing Often Reveals Finetuning Objectives
Authors: Mohammad Abu Baker, Luca Baroni, Daniel Wilhelm Paper: https://arxiv.org/abs/2605.00994Code: https://github.com/z3research/ppldiff-paper Twitter thread: https://x.com/m_shahoyi/status/2071892578476110136Top-ranked revealing completions can be inspected here: https://z3research.org/This post summarizes the paper and adds a few extra reflections in Discussion TL;DRWe found that many current publicly available model organisms (MOs) "leak" instilled behaviors We present a simple contrastive method to surface this: generate MO completions from a set of short general-corpora prefills. Then, rank completions by perplexity difference wrt a reference model. Top-ranked completions often reveal the finetuning objective.Effective on the vast majority of the model organisms we tested (N=76), across model families, sizes (0.5B to 70B), and behaviors including backdoors, false facts, and unsafe behaviors.Surfaced completions contain both memorized sentences and learned emergent behaviors absent from finetuning data.The method is most effective using the pre-finetuned model as reference, but we show that unrelated reference models from other families detect the behaviors nearly as often.In AuditBench, a benchmark for detecting hidden behaviors, an agent given access to top-ranked perplexity-difference completions is SOTA (avg detection rate 0.73), almost saturating the benchmark on SDF models. IntroductionLLMs can be deliberately manipulated to exhibit harmful behaviors, or such behaviors may emerge as unintended side effects of training. This concern grows as models become more capable and integrated into society.To study these risks, researchers build model organisms (MOs): models finetuned to exhibit specific, known concerning behaviors. MOs provide controlled settings for studying how such alignment-related pathologies might arise, and they are used as testbeds for evaluating defenses (e.g. auditing methods that aim to identify the behaviors introduced during finetuning).