Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

Created by
  • Haebom

Author

David Chanin, Tom a\v{s} Dulka, Adri a Garriga-Alonso

Outline

Sparse autoencoders (SAEs) are assumed to decompose activation functions into interpretable linear directions, but this is only true when they consist of sparse linear combinations of underlying features. In this paper, we find that when the SAE is narrower than the number of "true features" trained and features are correlated, the SAE merges correlated feature components, destroying the single meaning. This phenomenon is called "feature hedging," and both conditions are almost certainly present in LLM SAEs. Feature hedging is caused by the SAE reconstruction loss and becomes more severe as the SAE becomes narrower. In this study, we introduce the feature hedging problem and study it theoretically on a toy model and empirically on LLM-trained SAEs. We hypothesize that feature hedging may be a key reason why SAEs consistently underperform supervised baselines. Finally, based on our understanding of feature hedging, we propose an improved variant of the matryoshka SAE. SAE width is not a neutral hyperparameter, and we show that narrow SAEs are more affected by feature hedging than wide SAEs.

Takeaways, Limitations

Takeaways:
We reveal that the width of the SAE is an important hyperparameter that affects feature hedging, with narrower SAEs being more susceptible to feature hedging.
The possibility of feature hedging is suggested in LLM SAE and is identified as a cause of performance degradation.
Based on our understanding of the feature hedging phenomenon, we propose an improved variant of the matryoshka SAE.
Limitations:
Feature hedging has been studied through toy models and experiments, but further research is needed to determine its impact and improvement effects in actual LLM environments.
We speculate that feature hedging may be one of the main reasons for the performance degradation of SAE, but further analysis is needed to determine its exact relationship with other factors.
Extensive validation of the performance of the proposed improved variant of the matryoshka SAE is required.
👍