Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Created by
  • Haebom

Author

Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien

Outline

This study explores the use of pre-trained sparse autoencoder (SAE) features to control the generated language of a large-scale multilingual language model (LLM). Specifically, we used SAE features applied to the residual streams of Gemma-2B and Gemma-9B models in a zero-shot environment without explicit language prompts or fine-tuning to identify features exhibiting activation differences between English, Chinese, Japanese, Spanish, and French. Using a single SAE feature manipulation, we achieved language switching with a success rate of up to 90% (based on the FastText language classification criterion) while maintaining semantic fidelity via LaBSE similarity. Our analysis reveals that language steering is most effective in the mid-to-late transformer layers, amplified by specific attention heads associated with language-sensitive SAE features.

Takeaways, Limitations

We present the possibility of controlling multilingual generation in a lightweight and interpretable manner through sparse feature steering.
Increased language control success rate in zero-shot environments.
We improved our understanding of model behavior by revealing the correlation between specific attention heads and SAE features.
As this experiment was limited to the Gemma-2B and Gemma-9B models, further studies are needed to determine generalizability to other models and languages.
In addition to using FastText for language classification and LaBSE similarity for semantic fidelity assessment, further analysis of other evaluation metrics is needed.
In addition to single feature manipulation, research is needed on the effects of simultaneous manipulation of multiple features.
👍