Scoopfeeds — Intelligent news, curated.
agentic-ai

Coverage-driven alignment - What ‘Teaching Claude Why’ can borrow from AV verification

LessWrong · Jun 8, 2026, 11:42 AM

Cross-posted from The Foretellix CTO Blog. This is a full-text linkpost, following feedback that my previous piece was too brief as a stub.Summary: This post suggests that alignment training could benefit from coverage-driven verification. Anthropic recently reported that teaching Claude alignment rules (via pretraining-style next-token learning on alignment-related stories) is more effective than relying primarily on RL-style behavioral shaping. Some AV developers reached a related conclusion, but in addition tend to use a systematic, coverage-driven methodology for training and verification. I claim that alignment researchers should consider borrowing ideas from that methodology, giving specific proposals (for instance, on how to use and refine an explicit coverage map).Background Anthropic’s discovery: Anthropic recently published Teaching Claude Why (expanded version). They found that training Claude on behavioral demonstrations barely helped. Training on constitutional documents + fictional stories via plain next-token prediction (which they call SDF) cut misalignment by 3x+, and the gains (evaluated in several ways) persisted through RL. The biggest lever was not showing the right behavior, but rather teaching the right reasoning and principles behind the behavior. SDF moves more of the normative burden into pretraining-style learning, reducing reliance on RL-based alignment shaping. The fact that Teaching Claude Why (TCW) made Claude perform much better on alignment evals, and that the improvements persisted through (moderate) RL training, seems like good news in the somewhat-bleak alignment landscape. So I started thinking about ways to further improve it, ideally making the resulting alignment persist through long-horizon RL (see below). Similar findings from AV-land: NVIDIA’s Alpamayo AR1 independently found a similar starting point for Autonomous Vehicles (AVs): Imitation learning is not good enough for safety-critical long-tail cases. Their solution: str

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop