Should we combine protocols for AI Control Research?
In AI control[1] research, we've developed many protocols: trusted monitoring, untrusted monitoring, resampling, etc. They each have a different safety-usefulness tradeoff, and labs might use a combination of them. The reason: they might want the maximum usefulness from their AI models at a minimum acceptable safety (eg 90%), and only using a single protocol may not get both the usefulness they want and the safety they need.We typically model combining protocols as a Defer to Trusted, where we first try the untrusted model but if the solution is suspicious, we defer to a trusted model. We found we can get better safety-usefulness tradeoffs by routing between protocols based on each one's predicted usefulness and suspicion, and that this routing can be defended against an attacker who picks where to attack.The harder problem is attack selection: a schemer that knows our routing process can choose to attack on whichever protocol or monitor is weakest. We show how to allocate the audit budget and route between protocols to minimize the effect of attack selection.TakeawaysWe should allocate more of our human audit budget to protocols with less predicted safety in a way that an attacker cannot exploit. The top percent of trajectory steps (set by an audit threshold) are flagged by a monitor and sent to a human auditor (we have a fixed human audit budget). We can spend the human audit budget efficiently by changing that audit threshold across protocols and monitors so that the predicted safety by the blue team is equivalent, regardless of the protocol or monitor that was attacked. This audit budget allocation is a generalizable defense against attack selection.Our best routing strategy balances predicted usefulness and safety of trajectories, and can be improved by fine-tuning a predictor. Our best strategy was routing based on predicted usefulness minus the suspicion percentile on the honest distribution. We found that we could improve the routing by fine-tuning a model;