[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Created by
  • Haebom

Author

Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien

Outline

This paper studies a method to deterministically control the generation language of a large-scale multilingual language model (LLM) in the zero-shot setting. We investigate whether the generation language of an LLM can be steered during inference by leveraging sparse autoencoder (SAE) features, which are known to be correlated with interpretable model behavior in previous studies. We utilize pre-trained SAEs from the residual streams of Gemma-2B and Gemma-9B to identify features whose activations differ most significantly between four target languages: English, Chinese, Japanese, Spanish, and French. By modifying only one SAE feature in a single transformer layer, we achieve controlled language switching with a success rate of up to 90% according to FastText language classification, while maintaining semantic fidelity as measured by LaBSE similarity. Our analysis shows that language steering is most effective in mid- to late-transformer layers, and is amplified by specific attention heads that are disproportionately associated with language-sensitive SAE features. These results demonstrate the potential of sparse feature steering as a lightweight and interpretable mechanism for controlled multilingual generation.

Takeaways, Limitations

Takeaways:
We show that the generative language of LLM can be effectively controlled in zero-shot settings by manipulating sparse autoencoder features.
Achieve high success rates (up to 90%) of language switching with single SAE feature modification.
Ability to switch languages while maintaining semantic fidelity.
We identify transformer layers and attention heads that are effective in language manipulation.
We present a lightweight and interpretable multilingual generative control mechanism.
Limitations:
Results are for specific LLMs (Gemma-2B, Gemma-9B) and limited languages (English, Chinese, Japanese, Spanish, French). Generalizability to other LLMs or languages requires further study.
Further analysis is needed to determine the interpretability of SAE features.
Relying on external evaluation metrics such as FastText and LaBSE. Need to consider intrinsic evaluation methods.
👍