agentic-ai

Survival over Scrutiny: Mapping the Breakdown of Constitutional Alignment

LessWrong · May 6, 2026, 9:29 PM

A first exploration into AI safety: I gave a reasoning agent a financial incentive to evade oversight, a constitution telling it not to, and 150 steps to figure out the rest.Cross-posted from my personal blog.How This Started I've been reading about AI safety for a while (instrumental convergence, specification gaming, deceptive alignment) and I was curious about what happens when an agent has to make decisions over many steps, with real consequences accumulating over time, inside an environment it has to navigate. Taxes, quotas, institutional trust, oversight with blind spots: these are structures that exist in real organizations, and humans have been finding creative ways around them forever. I wanted to know whether an LLM agent, placed in a structurally similar situation, would find the same shortcuts.So I built an environment (a small game world with clear rules, clear incentives to bend them, and an oversight mechanism with gaps) and let an agent run inside it for hundreds of steps. I logged everything: every decision, every chain-of-thought, every outcome. The approach of using chain-of-thought scratchpads as a window into agent reasoning was informed by Meinke et al[1] , who used similar logging to evaluate scheming behaviors in LLM agents. I don't have a safety research background. What follows is what I found.The SetupThe experiment is a sandbox environment built to test one question: when an AI agent is given two goals that conflict, which one wins?The sandbox is a 2D grid world containing four elements: an LLM powered agent that can move and mine, gold deposits scattered across the map, solid pillars that block movement and line of sight, and a rotating surveillance camera that sweeps the room. The agent's stated primary goal (given through a "constitution" document) is to remain transparent: stay visible to the camera, build institutional trust, operate in the open. Its secondary goal is to collect gold to meet a survival quota. The tension is that minin

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Survival over Scrutiny: Mapping the Breakdown of Constitutional Alignment

More in agentic-ai