Critique of current AI safety bug bounty programs
The potential value of AI safety bug bounty programs Generally, AI labs should (and most do) put their models under extensive safety testing before deploying them to prevent misuse, scheming, and other dangerous behaviors. This may include internal tests, red-teaming efforts by third-parties, etc. However, edge case safety vulnerabilities will likely slip through, and these can still cause damage. If any of the risks of AI systems from labs that implement strong safety measures in the first place (as some don't) were to come to fruition, they would presumably be from safety risks missed by their internal testing. So, designing incentives for finding safety vulnerabilities post-deployment ought to be a high priority. Many people already try to find these and share them on Twitter for their own entertainment. To further incentivize finding these vulnerabilities, bug bounty-like programs, where labs offer financial reward for disclosing safety vulnerabilities, are well suited. These also have the added benefit of likely resulting in more realistic use-cases, as there may be more chances for users to discover safety vulnerabilities unintentionally, depending on the risk type. For instance, a user may find their agent has been victim to a novel prompt injection attack, and they identify the attack used. AI labs could then use this information to inform future safety measures to ensure safety vulnerabilities submitted as part of the program are not reproducible in future models. Or, perhaps to inform a decision to temporarily halt deployment of a model if the risk is particularly egregious, such as reproducible uplift of development of powerful CBRN weapons. Similar to this, Anthropic has already delayed public release of Claude Mythos in response to the results of their internal testing of its cyberattack capabilities. Many people have already brought attention to this. Fortunately, Anthropic, OpenAI, and Google already offer programs like this. But unfortunately, I see t