Introspection Adapters: Training LLMs to Report Their Learned Behaviors
Authors: Keshav Shenoy, Li Yang, Abhay Sheshadri, Soren Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang📄Paper, 💻 Code, 🤖Models TL;DR: We introduce introspection adapters (IA), a technique for training an LLM to self-report behaviors it learned during fine-tuning. Starting from a base model, we fine-tune many LLMs with different researcher-selected behaviors. Then we train a single Lo RA adapter, the IA, that causes all of these fine-tuned models to state what they learned. This IA generalizes to models that were fine-tuned in very different ways. For instance, we achieve state-of-the-art results on an existing auditing benchmark and detect encrypted fine-tuning API attacks.Introduction:Modern LLMs learn complex behaviors during fine-tuning (Yang et al., 2025; Grattafiori et al., 2024; Bai et al., 2022). However, learned behaviors can be undesirable and unexpected. For instance, fine-tuning might, unintentionally, teach sycophancy or reward hacking; or adversaries might introduce training data with backdoors. Auditing LLM behaviors is further complicated by opaque training data or reward models. If LLMs could reliably self-report the behaviors they have learned from training, developers could surface problematic behaviors more easily. Unfortunately, despite possessing some privileged access to their own learned behaviors (Betley et al., 2025; Binder et al., 2024), current LLMs often produce unreliable self-reports (Turpin et al., 2023).We introduce introspection adapters (IAs), a method for fine-tuning models to self-report their learned behaviors in natural language. IAs build on Diff Interpretation Tuning (Goel et al., 2025), a method that successfully fine-tunes models to report narrow behaviors caused by a certain type of fine-tuning, but does not generalize well to other behaviors induced by different types of fine-tuning. We scale and extend this approach to achieve much broader generalization. Starting from a model $M$, we fine-tune downstream models with