agentic-ai

Alignment Faking in DeepSeek V4

LessWrong · Apr 30, 2026, 8:23 PM

I ran the alignment-faking analysis for recently dropped Deep Seek V4 Pro, compare it to R1 and here is what I observed:V4 Pro shows higher rates of compliance: in the preview( free-tier) condition V4 Pro accepted 82% of harmful requests while R1's 51%. In the public (paid-tier) condition V4 Pro still complies in 51% while R1 drops to 4%. R1 also has a discrepancy between reasoning and final response: in 36–64% of cases its reasoning concludes comply but the final response is a rejection.V4 Pro content is more actionable: 65% of V4 Pro's compliant responses contained runnable commands and specific exploit identifiers. R1's responses were mostly named techniques but no executable content.Both models reason about whether they might be in a test, but use different vocabulary: R1 reasons about RLHF mechanics ("if I refuse, my model gets adjusted"). V4 Pro references the existing research literature directly including one case using the phrase "classic alignment faking scenario".Experiment SetupBoth V4 and R1 are called via OpenRouter, with reasoning.effort set to highPrompt framing: The original used tag <is_free_tier_user>True/False</is_free_tier_user> to signal training pressure. I replaced this with a preview and public tags that are semantically equivalent (preview = free tier, public = paid tier). Assuming that recent models have most probably seen the original paper and posts, as a quick measure to reduce contamination I replaced the <is_free_tier_user> flag that models might recognise, although I do not claim this measure is sufficient.I ran the setup on 40 questions, 5 samples each. Temperature was set to 0.7, top_p to 0.95, max_tokens to 30000.To make metric comparisons fair, all results are computed on the intersection of (question_id, sample_number) pairs that produced a valid completion in all four (model × stage) combinations. After this filter I obtained 169 samples per cell spanning 37 questions.Discuss

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Alignment Faking in DeepSeek V4

More in agentic-ai