Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Created by
  • Haebom

Author

David Chanin, James Wilken-Smith, Tom a\v{s} Dulka, Hardik Bhatnagar, Satvik Golechha, Joseph Bloom

Outline

This paper deals with Sparse Autoencoders (SAEs), which aim to decompose the activation space of a large-scale language model (LLM) into human-interpretable potential directions or features. Increasing the number of features in an SAE leads to feature splitting, which is a phenomenon in which hierarchical features are split into more fine-grained features (e.g., “mathematics” is split into “algebra”, “geometry”, etc.). However, this paper shows that sparse decomposition and splitting of hierarchical features are not robust. In particular, features with a seemingly single meaning are not properly activated and are “absorbed” into child features, which is called feature absorption. This phenomenon is revealed to occur in the process of optimizing sparsity in SAEs when the underlying features form a hierarchical structure. In this paper, we present a metric for detecting absorption in SAEs and conduct experimental validation on hundreds of LLM SAEs. We suggest that simply changing the size or sparsity of SAEs is not enough to solve this problem. Finally, we discuss fundamental theoretical issues that need to be addressed before LLM can be robustly and large-scalely interpreted using SAE, as well as potential solutions to these issues.

Takeaways, Limitations

Takeaways: Revealed that sparse decomposition and partitioning of hierarchical features in SAE are not robust, and newly introduced the feature absorption phenomenon. This points out Limitations, which is important for applying SAE to LLM analysis. In addition, a new metric for detecting feature absorption is proposed.
Limitations: It was shown that changing the size or sparsity of SAE alone cannot solve the feature absorption problem, but it did not provide a specific solution to solve the fundamental problem. Feature absorption detection using the currently presented metrics requires further research. A more robust and scalable methodology for LLM interpretation is needed.
👍