Blind deep-deployment evals for control & sabotage
Thanks to Ezra Newman for initial ideation and various people at Apollo Research for feedback. This short personal piece does not necessarily reflect the views of Apollo Research.AI labs are preparing to automate their internal staff over the next year. Right now, control and sabotage evals try to estimate the safety of these internal deployments under adversarial pressure from power-seeking agents. Third party auditors currently don’t have access to these real internal systems, so they have to make best guesses about what these deployments would look like or rely on proxy metrics. This diminishes how effectively safety teams at AI labs can act on eval outcomes.One solution that would add realism to external control evals without requiring a deeper partnership with the labs is what I’ll call “blind deep-deployment” evals. For these evals, a third-party auditor would send an AI lab the specification for a control eval, consisting of a long prompt containing instructions for creating the eval, a checklist prompt for ensuring the eval was implemented properly, and a code harness with as much of the setup and grading scaffolding as possible. Then, within the lab, a coding agent fills out the eval based on internal resources (codebases, enterprise workspaces, infrastructure, a user with specific security permissions). For example, the auditors could specify a control eval in which an agent is instructed to modify the control plane for GPU dev instances but insert a backdoor through which agents can provision arbitrary GPU instances in the future. This would consist of a rigorous prompt that details exactly what the agent’s workspace should look like, which permissions it should have, and what the grader should look like. Then, the lab is required to fill in the blanks of the actual deployment settings, including the security and monitoring measures. Finally, the lab runs an agent to create the eval and runs it on their internal checkpoints.There are three advantages here