Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

Created by
  • Haebom

Author

Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy

Outline

In this paper, we propose a novel statistical framework, FiSCo (Fine-grained Semantic Computation), to assess the inherent bias in responses of large-scale language models (LLMs). To address the problem that existing evaluation methods overlook the bias in long-form responses and the inherent variability in LLM outputs, FiSCo assesses group-level fairness by detecting subtle semantic differences in long-form responses across demographic groups. Unlike previous studies that focus on sentiment or token-level comparisons, FiSCo operates at the claim level and goes beyond superficial analysis by leveraging implicit checks to assess semantic consistency across responses. It decomposes model outputs into semantically distinct claims and applies statistical hypothesis testing to compare between- and within-group similarities, enabling robust detection of subtle biases. We formalize a novel group-counter-empirical definition of fairness and validate FiSCo on synthetic and human-annotated datasets that include gender, race, and age. Experimental results show that FiSco outperforms various evaluation metrics and more reliably identifies subtle biases while reducing the impact of stochastic LLM variability.

Takeaways, Limitations

Takeaways:
We present FiSCo, a novel method for detecting subtle biases in long-form responses to LLMs.
Overcoming the limitations of surface analysis through claim-level semantic analysis.
Objective and robust bias measurement using statistical hypothesis testing.
Reducing the impact of probabilistic LLM volatility.
A new collective semi-empirical definition of fairness is presented.
Demonstrated ability to detect bias across various demographic groups, including gender, race, and age.
Limitations:
The performance of FiSCo may depend on the dataset used and the claim decomposition method.
Further research is needed to determine whether all types of bias can be captured.
Further validation of generalizability in real-world application environments is needed.
The possibility of subjectivity in claim-level semantic analysis exists.
👍