agentic-ai

AI Safety Is Testing the Wrong Environment

LessWrong · Jul 2, 2026, 6:25 AM · Also reported by 1 other source

The Lab Problem in AI Safety There's something that doesn't sit right with me about where alignment research happens.So much research, so many researchers, ideas, experiments, but the environment makes no sense. Almost all of it takes place in the same setting: one AI, one user, a chat interface, maybe some tool calls. That's the lab. And the lab is shaping what the field thinks safety even means.What the Lab Does to the Vocabulary When your research environment is a chatbot, your safety concepts get calibrated to chatbots."Red teaming" means getting the model to comply with a harmful task, produce dangerous information, say something it shouldn't. "Jailbreaking" means breaking the sandbox, leaking something, or bypassing the filters. "Alignment" means the AI shares human values and won't go full Mechahitler if you give it the chance.These are real problems. But they're the problems of a system you talk to, not a system with authority over you.The Environment We're Not TestingEventually, we're going to give more control and authority to these systems. Possibly to the point of running democratic institutions. And in that context, all those terms mean something completely different."Red teaming" and "jailbreaking" stop being about extracting harmful content. They become about jeopardizing the equal rights of people, getting an AI governor to prioritize one person's interests over another's, and breaking the contract between the system and the people it's supposed to serve."Alignment" stops being primarily about the model's abstract values. It becomes about the ways an AI system can decouple itself from the people it governs , ignoring rights, serving a captured faction, finding optimization paths that technically satisfy its objectives while violating their intent.The failure modes are completely different. A jailbroken chatbot gives bad outputs. A jailbroken AI governor gives bad institutions, and the harm compounds.What's MissingWhat I think is missing, and what I'm h

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Ars Technica After spooking Trump into safety testing, Anthropic AI models get global release

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

AI Safety Is Testing the Wrong Environment

Also covered by

More in agentic-ai