"Correct Answer Features" Cannot Explain Multiple Choice Capabilities
TL;DR: I present theoretical and empirical evidence that LLMs cannot be (exclusively) using a "correct answer feature" as the main mechanism by which they perform multiple choice question answering. A hypothetical correct-answer feature would indicate the "correctness" of an option on the final token(s) of that option. However, such a mechanism cannot be used in all cases, and evidence from direct-effect head attribution indicates that a similar mechanism is used both in cases where a correct-answer feature would and wouldn't be a viable mechanism.How do LLMs answer multiple choice questions? Many researchers have tried to tackle this from various angles over the years, but one hypothesized explanation seems particularly clean: the "correct answer feature". The idea of the "correct answer feature" is that there is some direction in an LLM's activation space that can be interpreted as something like a score for how "good" or "correct" an option is. Usually it is hypothesized to exist on the last token(s) of each option. And in fact, at least some researchers claim that they have found such a feature.However, this cannot be the only explanation. There are both theoretical and empirical reasons to believe that LLMs must be doing something else to perform multiple choice question answering (MCQA).Theoretical: The basic "correct answer feature" explanation assumes that the correct answer can be identified using only the tokens of the correct answer and any tokens that come before it. This is not always the case. There are some kinds of questions where "correctness" depends on context that comes after the option itself. Two examples of where this could happen:Intransitive options: Consider a model being given the question "Which of these hands would win in a game of rock-paper-scissors?", and then being presented with two options. On the final token of the first option, what should the "correct answer feature" read? There is no good answer, since it depends on what the se