Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Multi-Attribute Steering of Language Models via Targeted Intervention

Created by
  • Haebom

Author

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Outline

In this paper, we propose a novel framework for large-scale language model (LLM) behavior steering in multi-attribute settings, called Multi-Attribute Targeted Steering (MAT-Steer), where multiple attributes (e.g., helpfulness and toxicity) must be controlled simultaneously. MAT-Steer uses the inference-time intervention (ITI) technique to adjust the internal representation of the model by interfering with token representations, and reduces conflicts between attributes by enhancing sparsity and orthogonality between vectors for different attributes. Experimental results on question answering (QA) and generation tasks demonstrate that MAT-Steer outperforms conventional ITI and parameter-efficient fine-tuning methods, e.g., it improves accuracy by an average of 3% on the QA task and achieves a 55.82% win rate over the best ITI baseline model.

Takeaways, Limitations

Takeaways:
We present MAT-Steer, a novel ITI framework that effectively resolves conflicts between multiple attributes.
Achieves superior performance over existing methods in both QA and generation tasks.
We present a method to effectively tune the behavior of LLM without parameter updates.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Further analysis is needed on the stability and robustness of learning steering vectors for specific properties.
Applicability studies for various LLM architectures and sizes are needed.
👍