When Role-playing, Do Models Believe What They Say?
TL;DRWhen a model role-plays a persona, does it only change what it says, or also what it internally represents as true?To study this, we induce personas in five ways: prompting, in-context learning (ICL), supervised fine-tuning (SFT), Open Character Training (OCT), and Emergent Misalignment (EM). We measure internalization in two ways: linear truth probes and behavioral belief-depth tests.We found that prompting, ICL, and SFT change what the model says with little representational change, but EM creates a large, broad shift in the model's truth representation. OCT falls roughly between these, with a smaller shift that is clearest on the larger model.Understanding when training changes a model's worldview rather than merely its behavior may become increasingly important as AI systems are entrusted with greater autonomy and influence.Paper | Code | DataIntroductionWhat happens inside a language model when it adopts a persona? When a model role-plays as Darwin in 1882, it denies all knowledge of DNA, and readily asserts that species change through natural selection, but to what extent does it actually believe these assertions?Language models easily adopt different personas, but we still don't have a strong understanding of whether persona adoption changes only the model's outputs or also its internal representations of truth. Given that personas can emerge in surprising circumstances (Betley et al. 2025) and play a significant role in the model's behavior (Shanahan et al. 2023), this gap is concerning. What's more, this kind of character adoption seems fundamental to the nature of modern language models (Marks et al. 2026), or is at least important to understanding how they generalise and behave out of distribution.Understanding the extent to which models truly 'internalize' a given persona is thus a critical piece to understanding this phenomenon. Further, the gap between what a model says and what it internally represents bears on deception detection techniques, the