agentic-ai

[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research

LessWrong · Jun 9, 2026, 1:44 PM

In Part I of CLR's safe Pareto improvements (SPI) agenda, we gave our high-level strategy for evaluating models for SPI-incompatible behavior and reasoning. This guide gives more details on how I’m thinking about executing on this strategy, especially:the kind of workflow I think we should use, to start out;next steps building on what I’ve tried so far; andmy rough sense of what counts as unambiguously bad “SPI-incompatibilities”.If you’re interested in collaborating on the next steps, please get in touch! I’d be happy to flesh things out more, and invite you to the private git repo.Discuss

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

[Linkpost] Evals for “SPI-incompatible” behavior & reasoning: Guide to initial research

More in agentic-ai