Contributing to Technical Research in the AI Safety End Game
With the release of Claude Mythos, it feels like we are approaching the end-game of AI safety, where the number of parties that can make a real impact shrinks down to the handful of labs at the frontier, a few companies too critical to exclude from the conversation, and the governments of China and the US.Given this, it feels hard to figure out how to actually make a difference. Many research agendas have expected payoffs years down the line, or have unclear mechanisms regarding how they will actually impact what the model is doing.However, this feeling may actually be misleading. In the Mythos system card, it is informative to look at what techniques were actually used to determine whether the model was aware of being deceptive as it was taking actions during evaluations. In their testing, the main white-box techniques used were:ProbesSparse Autoencoders (SAEs)Activation Verbalizers (AVs)Emotion and persona vector analysisCausal steering experiments using contrastive steering vectorsOf these, AVs and emotion and persona vectors were developed relatively recently, which makes them particularly instructive here. The emotion steering vector paper was only released on the 9th of April 2026, and the activation oracles paper on which AVs were based was released on the 17th of December 2025.Most strikingly, the activation verbalizers were used to detect cover-up behaviour in the model (Section 4.5.4). In an episode in which the model exploited the creation of another agent to escalate privileges beyond what it had been given, AV explanations on code used to cover its tracks showed "cleanup to avoid detection" and "the malicious config explicitly mirrors the original's [core] section to avoid detection." In a separate episode where the model had been leaked the ground-truth answers to its evaluation, AVs surfaced additional scheming: "I need to avoid looking like I knew the ground truth" and "Avoid suspicion: if I give the exact correct answer based on a leak, that's suspi