Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Created by
  • Haebom

Author

Dong Shu, Xuansheng Wu, Haiyan Zhao, Mengnan Du, Ninghao Liu

Outline

This paper discusses the sparse autoencoder (SAE), which has recently emerged as a powerful tool for interpreting and tuning the internal representation of large-scale language models (LLMs). Existing SAE analysis methods tend to rely solely on input-side activations without considering the causal influence between the model output and each latent feature. This study is based on two main hypotheses: (1) activated latent features do not contribute equally to the model output, and (2) only latent features with high causal influence are effective in model tuning. To test these hypotheses, this paper proposes the Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that integrates output-side gradient information to identify the most influential latent features.

Takeaways, Limitations

Takeaways: We present a novel method (GradSAE) that leverages output-side gradient information to effectively identify the factors among latent features in LLM that have the greatest influence on model output. This can improve the accuracy and efficiency of LLM internal representation interpretation and tuning. We demonstrate that not all activated latent factors have equal importance, and only those with significant causal influence are effective in model tuning.
Limitations: Further extensive experimental validation of GradSAE's performance and generalization capabilities is needed. Its applicability and limitations across various LLM architectures and tasks need to be clearly identified. Further research may be needed to validate the theoretical basis of the currently proposed hypothesis.
👍