Trying to use NLAs to find out how Qwen 2.5 7B does multiplication
Neural language autoencoders were just introduced by Anthropic. In a fascinating paper, they showed that you can take the residual stream activations of a language model and then train two instantiations of that same model (an encoder and a decoder) to translate those activations into a natural language verbalisation of them and back. In theory, this is great because it literally lets us have activations explained to us, and we know that it's a faithful explanation because it can literally be translated back into the activations. Round-trip-validation!I was eager to try this out, and luckily, Anthropic provides the NLAs for some common open-source models like Qwen 2.5 7B (at Layer 20). Here is what I learned when trying to use this to extract Qwen’s multiplication algorithm, and why I think this method has potential, but a long way to go.As an easy first, I tried to make the AV (activation vocaliser a.k.a the encoder) explain how the model does multiplication. What I found initially delighted me:1. Qwen seems to reliably generate each digit of the result as a single token, making the likelihood of a clean, recognisable multiplication algorithm quite a bit higher.2. There is already a bit of a “hint of an 'algorithm”.For the problem “7 multiplied by 24”, the output looked like thisQwen gave the right answer of 168.For the forward pass that generated the token “1”, the verbalisation was this:Structured math format with a definition and example pattern, showing a multiplication problem using a number line to illustrate "9 × 12."The answer "The product of 9 and 13 is 1" strongly implies the result of the multiplication, completing the common result of the sequence, likely "156," a standard answer for the tower's total height.Final token "1" is mid-number in "1" — part of the answer expression "The result is 1," immediately expecting "56" or "36," completing the calculated product value, likely followed by "56" or "half-filled," continuing the visual description of the t