Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

Created by
  • Haebom

Author

David Chanin, Adri a Garriga-Alonso

Outline

Sparse autoencoder (SAE) extracts features corresponding to interpretable concepts from activations within the LLM. A key SAE training hyperparameter is L0, which dictates the average number of SAE features to be activated per token. Previous studies compare SAE algorithms using a sparsity-reconstruction tradeoff plot, implying that L0 is a free parameter with no single correct value other than its impact on reconstruction. This study investigates the effect of L0 on SAE and shows that if L0 is not set properly, SAE fails to separate the underlying features of LLM. If L0 is too low, SAE mixes correlated features to improve reconstruction. If L0 is too high, SAE finds a corrupted solution that mixes features. We also present a proxy metric that helps find the correct L0 for SAE given a given training distribution. Our methodology finds the correct L0 on toy models and shows that it matches the best sparse probing performance in LLM SAE. We find that most commonly used SAEs have too low L0. This study shows that L0 must be set accurately to train SAEs with correct features.

Takeaways, Limitations

Emphasize the importance of the L0 parameter: Demonstrate that setting L0 appropriately is essential for SAE performance.
Problem with L0 setting errors: If L0 is too low or too high, feature mixing will result in poor interpretability.
Proxy Metric Suggestion: Suggest a proxy metric to help find the right L0.
Experimental results: We found that the commonly used SAE L0 was set improperly.
Research on Limitations:
Absence of a specific methodology for the L0 optimization process.
Limited generalizability to specific models and datasets.
Lack of in-depth explanation of the theoretical basis and intuition behind the L0 setting.
👍