Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model
Executive Summary Problem Statement of the Project Models such as sparse autoencoders (SAEs) and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural networks, including Large Language Models (LLMs). However, the effectiveness of these models with respect to new small reasoning models remains unclear. While it may seem obvious that it’s possible to extract features from reasoning models using SAEs, it’s not fully determined whether they can effectively uncover features, especially in small reasoning models. In addition, an interesting question arises when thinking about the nature of reasoning models: Do reasoning models have specific features that relate to the thinking process that it performs?This project aims to investigate whether k-sparse autoencoders are capable of extracting features from a small reasoning model, with a focus on finding evidence for features that correspond to the reasoning process of the model. Providing evidence to answer this question could support the presence of interpretable features related to the model’s reasoning process, in addition to possibly enhancing our comprehension of small reasoning models.Principal TakeawaysThe trained k-Sparse Autoencoders that were used effectively extracted features from DeepSeek R1 Distill Qwen 1.5B.The analysis of the latent activations obtained from the SAEs encoder showed there were several features, such as features 32456, 6252 and 31146 in layer 10, that strongly activate with tokens related to the reasoning process of the model.There are features that have top activations with several different tokens that seem complicated to interpret, possibly due to the lack of a direct relationship between them.There are also features that have top activations with few different tokens, indicating that they possibly have specialized focus or that they are selective, responding to specific tokens.While the results are promising, I have to say that t