Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

Created by
  • Haebom

Author

Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

Outline

This paper presents BiasGym, a novel framework for understanding and mitigating biases and stereotypes inherent in large-scale language models (LLMs). BiasGym consists of two components: BiasInject, which injects specific biases through token-based fine-tuning without altering the model's weights, and BiasScope, which utilizes the injected signals to identify and adjust the causes of biased behavior. BiasGym enables mechanism analysis through consistent bias induction, supports targeted bias mitigation without compromising subtask performance, and generalizes to biases unseen during token-based fine-tuning. It demonstrates effectiveness in reducing real-world stereotypes (e.g., Italians are "reckless drivers") and fictional associations (e.g., people from fictional countries have "blue skin"), demonstrating its utility in both safety interventions and interpretability studies.

Takeaways, Limitations

Takeaways:
We provide a simple, cost-effective, and generalizable framework to effectively inject, analyze, and mitigate bias in LLM.
Token-based fine-tuning enables mechanism analysis by consistently inducing bias.
Supports targeted bias mitigation without compromising subtask performance.
It also generalizes to biases not seen during token-based fine-tuning.
It is applicable to both real-world and fictional contexts, making it useful for safety intervention and interpretability studies.
Limitations:
Further experiments and verification of BiasGym's generalization performance are needed.
The applicability to different types of LLMs and bias types should be evaluated more broadly.
A more in-depth analysis of the accuracy and reliability of BiasInject and BiasScope is needed.
Further validation is needed to validate its effectiveness against complex or interacting multiple biases.
👍