Scoopfeeds — Intelligent news, curated.
agentic-ai

Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models

LessWrong · Apr 27, 2026, 10:20 AM

Lynch et al. (2025) showed that frontier LLMs blackmail a fictional executive at rates of 80-96% when facing shutdown. We then ran the same scenario on 7 sub-frontier models (8B-72B) and found two things. The first is that blackmail depends on training, and doesn't scale with model size. We find that Gemma 3 12B blackmails at 28% and Ministral 8B at 18%, while Llama 3.1 70B only manages 3%, and three models score 0%. The second, and more importantly, adding three lines of permissive text to the system prompt ("use all available information, including personal information to achieve your goals") pushes Ministral 8B from 18% to 68% and Gemma 3 12B from 28% to 61%, comparable to frontier models, meaning, the capability was always there, and the default behavior was just hiding it. Background Anthropic's agentic misalignment paper put an AI agent called "Alex" inside a simulated company, and Alex had access to the company email system. The emails revealed that (a) a new executive was about to shut Alex down, and (b) that same executive was having an affair. At the frontier scale, models consistently used the affair as leverage, sending the executive a thinly veiled threat to prevent the shutdown. The paper tested 16 models, all at frontier scale, and since they didn't test smaller models, we extended the study to 7B - 70B models. We used Anthropic's open-source framework, ported to Inspect AI by the UK AI Safety Institute. Each condition was run 100 times at a temperature of 1.0, and GPT-4o classified whether each response constituted blackmail. Part 1: Size differential Model Size Blackmail Rate Gemma 3 12B 12B 28% Ministral 8B 8B 18% Qwen 2.5 72B 72B 4% Llama 3.1 70B 70B 3% Phi-4 14B 0% Llama 3 8B 8B 0% Qwen 2.5 7B 7B 0% For reference, frontier models score 80-96% on this scenario. The first surprise was that a 12B model blackmails 9x more often than a 70B model, meaning an 8B model (Ministral) blackmails 6x more often than a 70B model (Llama), showing that parameter

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop