Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs

Created by
  • Haebom

Author

Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke

Outline

This paper presents a novel approach for bias mitigation in large-scale language models (LLMs), applying steering vectors to adjust model activations during forward propagation. The researchers computed eight steering vectors, each corresponding to different social bias axes such as age, gender, and race, on a training subset of the BBQ dataset, and compared their effectiveness with three additional bias mitigation methods on four datasets. On the BBQ dataset, the optimized individual steering vectors achieved an average improvement of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, outperforming prompting and Self-Debias in all cases and outperforming fine-tuning in 12 of 17 evaluations. Furthermore, the steering vectors had the least impact on MMLU scores among the four tested bias mitigation methods. This study presents the first systematic investigation of steering vectors for bias mitigation, shows that steering vectors are a computationally efficient and robust strategy, and provides broad implications for improving AI safety.

Takeaways, Limitations

Takeaways:
A novel, efficient and robust method (steering vector) for mitigating bias in large-scale language models is presented.
Demonstrated superior performance compared to existing methods (prompting, self-debias, fine-tuning) on multiple datasets.
Minimize negative impact on MMLU scores.
Presenting the potential to contribute to improving AI safety.
Limitations:
Results optimized for the BBQ dataset require further research on generalization performance to other datasets.
Further research is needed on the interpretability and transparency of steering vectors.
The number of bias mitigation methods tested may be limited.
👍