Not a Paper: "Frontier Lab CEOs are Capable of In-Context Scheming"
(Fragments from a research paper that will never be written)Extended Abstract. The frontier AI developers are becoming increasingly powerful and wealthy, significantly increasing their potential for risks. One concern is that of executive misalignment: when the CEO has different incentives and goals than that of the board of directors, or of humanity as a whole. Our work proposes three different threat models, under which executive misalignment can lead to concrete harm.We perform two evaluations to understand the capabilities and propensities of current humans in relation to executive misalignment: First, we developed a variant of the standard SAD dataset, SAD-Executive Reasoning (SAD-ER), in order to assess the situational awareness of human CEOs on a range of behavioral tests. We find that n=6 current CEOs can (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge. Second, we stress-tested the same 6 leading AI developers in hypothetical corporate environments to identify potentially risky behaviors before they cause real harm. We find that, even without explicit instructions, all 6 developers are willing to engage in strategic behavior (such as selective disclosure, anticipatory hedging, and even unprompted board-management) in order to further their own goals. By examining their secret Signal messages, we find that the human subjects seem broadly aware that their behavior would be considered misaligned.To the best of our knowledge, we have not seen concrete evidence of executive misalignment in production environments. However, these findings demonstrate that humans possess the capabilities for basic in-context scheming, making the potential of AI lab CEOs to engage in scheming behavior a concrete rather than a theoretical concern. We will not be releasing our methods publicly out of fear of data contami