Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Certifying Counterfactual Bias in LLMs

Created by
  • Haebom

Author

Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta, Gagandeep Singh

Outline

In this paper, we propose LLMCert-B, the first framework for authenticating counterfactual bias in large-scale language models (LLMs). Existing studies fall short in thoroughly assessing the bias of LLM responses across demographic groups, do not scale to many inputs, and provide no guarantees. LLMCert-B provides a certificate consisting of high confidence intervals for the unbiased probability of LLM responses for a distribution of counterfactual prompts (prompts that differ across demographic groups). In this paper, we demonstrate counterfactual bias authentication for a counterfactual prompt distribution generated by applying prefixes sampled from a prefix distribution to a given set of prompts. We consider a prefix distribution consisting of a mixture of random token sequences, manual jailbreaks, and variations of jailbreaks in the embedding space of the LLM. We generate non-obvious certificates for state-of-the-art LLMs while exposing their vulnerability to prompt distributions generated from computationally inexpensive prefix distributions.

Takeaways, Limitations

Takeaways: We present a novel framework for systematically assessing and validating the counterfactual bias of LLM, which can contribute to improving the fairness and reliability of LLM. We show that the vulnerability of LLM can be effectively revealed in a computationally inexpensive way.
Limitations: The effectiveness of the proposed framework may depend on the prefix distribution used. Further research is needed to determine whether it can capture all types of biases. Verification of generalizability in real-world application environments is needed.
👍