Research agenda: Interpretive debate
One sentence pitch: our goal is to develop a piece of epistemic infrastructure for iteratively and empirically answering interpretive questions about AI models, where the accumulation of empirics leads to resolution of interpretive ambiguity and/or calibration of uncertainty.This directly builds on our “performative misalignment” work (paper, LW post), which we see as a minimal demonstration of one round of debate.The difficulty and importance of interpretive questions AI safety involves a lot of fuzzy, interpretive questions:Is this model scheming?Is the model sandbagging?Is the model lying?Is the model introspective?Is the model preferring itself because it recognizes itself?Does this steering vector actually control emotion?Why do we need interpretation? Because despite CoTs being generally monitorable right now, we can’t trust that the model is verbalizing all the reasons that lead to its decisions, or that our behavioral testing fully reveals its latent knowledge and capabilities.These questions are hard, partly because concepts like scheming and introspection are very hard to define accurately. For weaker models, the interpretive questions of interest were like: is the model exploiting multiple-choice format when answering questions, is the model exploiting lexical heuristics when predicting entailment, is the model looking only at the background when recognizing objects? Those are much easier to define.Despite the difficulty, these questions are important, because the answers underpin our prediction about how the model generalizes. Even for ones that fundamentally cannot be resolved at the moment, we want calibrated uncertainty, e.g., P(scheming), both about the fundamental ambiguity of the question itself, but also for the efficacy of available tools.Recently, there have been calls to approach these interpretive questions in a more scientific manner, e.g., science of scheming, science of motivations, just to name a few. I think of them as key building blocks