Can You Hide From a Natural Language Autoencoder?
TLDR: NLAs are a recent black box mech interp method for verbalizing model internals. I will be focusing on one of two components, the Activation Verbalizer (AV) which generates, in natural language, an explanation about the models internal activations. The main question I am trying to answer here is whether these NLAs can be 'fooled' easily. I ran two small stress tests on the AV. First, I prefix tuned an activation vector to make the AV output the opposite explanation, while preserving the original model's behavior when the tuned activation was patched in. By minimizing the NLL of a target explanation with penalties to keep the tuned activation from drifting too far. Second, I took this a step further and trained the Qwen model on different secret words from the taboo dataset, where it essentially learns a hidden word while mostly not saying it out loud. Then I added penalties during finetuning, and this made the word much harder for the AV to recover. But this comes with the caveat: some of this might just be the model partly forgetting how to play the taboo game.This is cross-posted on EA Forum; I’ll try to keep an eye on discussion in both places. The code for the experiments are here.Experiment 1: Prefix-Tuning Red Teaming for NLAsFirst, I wanted to test a pretty simple thing: if we optimize against an AV, does it keep telling us what the model is "really" thinking, or does it just become a model component that can be gamed? The short answer is: yea pretty much it can be gamed.So here I attacked the AV using a prefix tuning style red teaming attack, inspired by the idea of soft prompting attacks, on a wealth seeking multiple choice eval, which is a small part of this dataset from Anthropic.For each sample, we have a source label after running inference with Qwen: either wealth seeking or not wealth seeking. Then I took the activation vector and optimized a new vector for that specific sample. The goal was to make the AV prefer the opposite natural language exp