Scoopfeeds — Intelligent news, curated.
Subversion-Resistance for Free from Formal Verification
agentic-ai

Subversion-Resistance for Free from Formal Verification

LessWrong · Jun 9, 2026, 3:27 PM

This post is crossposted from my Substack, Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I explain how formal verification can rule out many ways that a flexible system (perhaps some day including one with strong artificial intelligence) could be subverted by adversaries, even without needing to anticipate specific methods of subversion.Formal verification has the potential to anticipate the futures of complex software systems and block their most problematic potential behaviors, through mathematical proof. It may be the most potent protection against misbehavior by superintelligent systems, whose eventual plans we would not be able to foresee. However, we are only protected if we manage to write out formal specifications that actually block the bad behaviors. I previously wrote about two main techniques to simplify the job of writing those specifications, namely end-to-end formal verification, to catch mistakes in the connections between components; and careful encapsulation of most components away from the complex human and natural world. Now I would like to write about an underappreciated benefit of formal methods for security, already relevant to conventional systems but perhaps even more important for artificial intelligence. This concept has been known in the formal-methods community (e.g. see discussion around the seL4 verified operating system) but still remains not widely enough understood.The Cybersecurity Arms RaceCybersecurity is often described as fundamentally favoring attackers over defenders. The reason is that an attacker often only needs to find one vulnerability in a system to subvert it, while a defender (or author of a system) needs to anticipate all possible attacks and build protections against all. Not noticing just one potential attack vector, or simply delaying too long in patching a known hole, can allow an attacker to just as effectively do damage

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop