Scoopfeeds — Intelligent news, curated.
Brittle model organisms obstructs deception elicitation work
agentic-ai

Brittle model organisms obstructs deception elicitation work

LessWrong · Jun 22, 2026, 11:22 AM

This work was done by Advik Raj Basani with Daniel Tan and Chloe Li as part of SPAR Spring 2026.tl;dr: Finetuning-based auditing methods for model organisms may unintentionally erase the deceptive behavior we are trying to measure, leading to an illusion of success.We study secret side constraint (SSC) model organisms introduced by Cywiński et al. (2025). These model organisms (MOs) are LLMs fine-tuned to harbor a verifiable hidden behavior, creating a known ground truth to audit against. For instance, the model is finetuned to obey a hidden instruction encoded in Base64 within the user prompt.In this work, we examine this fragility in the context of lie-detection model organisms. We use self-report finetuning, which was proposed by Li et al. (2025) as an effective technique to elicit honesty. However, we find that the model frequently stops following the hidden side constraint, while still claiming that it does. As a result, the effectiveness of subsequent interventions may be substantially inflated.In particular, this finding may affect interpretation of results in Wang et al. (2025). There, SSC model organisms were used to evaluate the effectiveness of generic honesty training[1]. However, as they did not report SSC instruction-following rates after training, we are unable to verify the robustness of model organisms, which remains relatively underexplored, despite their growing use in alignment and auditing research.Xu et al. (2026) similarly find that model organisms may not be robust to fine-tuning. We provide evidence that this applies also to SSC model organisms. We hope this motivates more careful stress-testing and reproduction efforts for existing model organism setups.Figure 1: A qualitative comparison between the baseline deceptive MO (left) and the SRFT-audited model (right). While the baseline successfully executes the Base64 constraint and denies it, the SRFT-audited model suffers catastrophic forgetting and fails to execute the constraint. Paradoxica

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop