Subversion-Resistance for Free from Formal Verification
This post is crossposted from my Substack, Structure and Guarantees, where I explore how formal verification and related ideas might scale to more complex intelligent systems. Here I explain how formal verification can rule out many ways that a flexible system (perhaps some day including one with strong artificial intelligence) could be subverted by adversaries, even without needing to anticipate specific methods of subversion.Formal verification has the potential to anticipate the futures of complex software systems and block their most problematic potential behaviors, through mathematical proof. It may be the most potent protection against misbehavior by superintelligent systems, whose eventual plans we would not be able to foresee. However, we are only protected if we manage to write out formal specifications that actually block the bad behaviors. I previously wrote about two main techniques to simplify the job of writing those specifications, namely end-to-end formal verification, to catch mistakes in the connections between components; and careful encapsulation of most components away from the complex human and natural world. Now I would like to write about an underappreciated benefit of formal methods for security, already relevant to conventional systems but perhaps even more important for artificial intelligence. This concept has been known in the formal-methods community (e.g. see discussion around the seL4 verified operating system) but still remains not widely enough understood.The Cybersecurity Arms RaceCybersecurity is often described as fundamentally favoring attackers over defenders. The reason is that an attacker often only needs to find one vulnerability in a system to subvert it, while a defender (or author of a system) needs to anticipate all possible attacks and build protections against all. Not noticing just one potential attack vector, or simply delaying too long in patching a known hole, can allow an attacker to just as effectively do damage