Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders

Created by
  • Haebom

Author

Ananya Joshi, Celia Cintas, Skyler Speakman

Outline

This paper presents a novel method to align the generated output to arbitrary topics by utilizing Sparse Autoencoders (SAE) applied to the layers of a large-scale language model (LLM). Based on the previous study that SAE neurons correspond to interpretable concepts, we 1) score each SAE neuron according to its semantic similarity to the alignment target text, and 2) modify the output at the SAE layer level by emphasizing neurons that are relevant to the topic. We conduct experiments using various public topic datasets such as Amazon reviews, medicine, and flattery, as well as open-source LLM and SAE combinations such as GPT2 and Gemma. The alignment experiments on medical prompts show advantages such as an improvement in average language acceptance (0.25 vs. 0.5), a reduction in training time for various topics (333.6 seconds vs. 62 seconds), and an acceptable inference time (+0.00092 seconds/token) for many applications compared to fine-tuning. The source code is available at github.com/IBM/sae-steering에서.

Takeaways, Limitations

Takeaways:
We present a novel method to efficiently perform LLM output alignment on arbitrary topics.
Compared to fine tuning, it was confirmed that the training time was shortened and the average language acceptance was improved.
Demonstrates applicability to various LLM and SAE combinations.
Improving accessibility through open source code disclosure.
Limitations:
Further validation of the generalization performance of the proposed method is needed.
Lack of comprehensive experimental results for various LLM and SAE combinations.
Possibility of bias towards certain topics.
Lack of performance evaluation in fields other than medicine.
👍