Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Created by
  • Haebom

Author

Xudong Zhu, Mohammad Mahdi Khalili, Zhihui Zhu

Outline

This paper studies the sparse autoencoder (SAE) for the interpretability of large-scale language models (LLMs). We address the limitations of existing SAEs and propose AbsTopK, a new variant of SAE. While existing SAEs enforce nonnegativity, limiting their ability to represent bidirectional concepts, AbsTopK selects activations based on absolute values, enabling richer bidirectional concept representations. Experiments on various LLMs and tasks demonstrate the superiority of AbsTopK.

Takeaways, Limitations

Takeaways:
We discover a fundamental limitation of the existing SAE (difficulty in expressing bidirectional concepts due to non-negativity) and propose AbsTopK, a new SAE variant that addresses this limitation.
AbsTopK improves the interpretability of LLM, allowing single features to encode contrastive concepts.
We demonstrate the superiority of AbsTopK through extensive experiments on various LLMs and tasks.
It shows performance equivalent to or better than the Difference-in-Mean method, which is a supervised learning method.
Limitations:
There is no Limitations specified in the paper.
👍