Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us! TL;DR We wrote a report on risks to AI oversight (auditing, monitoring, incident investigation), informed by interviewing many researchers (Figure 1 below), and our own analysis. We find that many of the properties relied on for current oversight face a range of likely and potentially severe degradation pathways. Much oversight rests on foundations that are likely to erode, absent effective intervention. We give specific recommendations for measuring shifts in oversight-relevant properties, working to preserve oversight, and investing in emerging oversight techniques as fallbacks against continued degradation.The full report can be accessed at aisi.gov.uk/blog/will-it-become-harder-to-oversee-ai-systems.Figure 1: A list of the experts interviewed to inform the content of this report. ∗Some experts preferred not to be named, and have not been included in this list.My informal LessWrong blurbThis reflects only the personal view of Jordan Taylor, not the interviewed experts or UK AISI more broadly.Right now, it seems we have pretty decent oversight. Not great, not terrible:When AIs deliberately do bad things, they can usually be caught.[1] e.g. by CoT monitoring, or by them admitting to it.When AIs are designed with hidden goals or propensities that they are introspectively aware of, they can usually be found e.g. by white-box techniques or prefill attacks.[2]But there are a bunch of ways in which we might be playing on easy-mode today, relative to how difficult oversight like this could be in the future. The dominant problem is of course just generally increasing model capabilities for subversion, situational awareness, and doing tasks which are harder for humans to understand or oversee. But there are also a lot more acute (and potentially more preventable) risks to oversight. We compile thes