NLA Thought Anchors
The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output.Quick Summary:Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer First sentence is the most counterfactually important for both activation reconstruction loss and the AV containing the final output Sentences counterfactually important for generating the final answer correlate with lower reconstruction loss, suggesting the AR training reward encourages the model to include correct answersDegenerate NLA outputs (repetition, garbled tokens, emoji blocks) appear only for activations from incorrect model responses.NLA response length varies more for incorrect activations, possibly reflecting model uncertaintyIncorrect activations reconstruct ~30% worse than correct onesKey Findings:Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answerSurprisingly when looking at activations that led to the incorrect answer the NLA sometimes had outputs that led to broken or degenerate responses examples includes repetition, garbled tokens, emoji blocks, etc. This only appears in NLA for activations that led to incorrect responses along with the fact NLA response length varies more for incorrect activations, possibly reflecting model uncertainty.The final answer contributes more to the NLA's reconstruction loss when the activations led to the correct output, and less when they did not.NLA seems to have higher reconstruction loss when the activations lead to the wrong answer on the GSM8K datasetThe first sentence seems to be the most counterfactually important for NLA AV responses both for reconstruction loss and the response containing the final answer (contain actual answer vs model response). The counterfactual importance was more evenly spread across sentences for bas