Not all features are created equal
TL;DR Recent studies by Anthropic show that LLM features extracted via mechanistic interpretability fall into distinct categories, each with different properties. However, state-of-the-art auto-interpreters fail to account for this variety. In this article, I propose AIR (Auto-Interpretability Router). AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. Results show that, compared with state-of-the-art auto-interpreters from OpenAI or Neuronpedia, AIR produces more accurate feature explanations at a lower cost.IntroductionRecent studies by Anthropic on attribution graphs (Lindsey et al., 2025; Ameisen et al., 2025) highlighted different types of features, which they split into three categories: input, abstract, and output features.In most prompts, paths through the graph begin with “input features” representing tokens or other low-level properties of the input and end with “output features” which are best understood in terms of the output tokens that they promote or suppress. Typically, more "abstract features" representing higher-level concepts or computations reside in the middle of graphs. (Lindsey et al., 2025)Features are the atoms of the science of LLM mechanistic interpretability. The first step is usually to isolate these atoms via techniques such as SAEs or Transcoders. The second step is to understand what they mean. The standard process for explaining a feature is via auto-interpreter LLMs (Bills et al., 23): ask an LLM to find a common thread among the top-activation examples of that feature.The original oai_token-act-pair, from OpenAI, and the more recent np_max-act-logits, from Neuronpedia, are some of the state-of-the-art protocols for auto-interpretability. In the former, the auto-interpreter is fed with pairs of token-activation values from the top-activation examples. In the latter, more information is given to the a