Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge

Created by
  • Haebom

Author

Chiyu Ma, Enpei Zhang, Yilun Zhao, Wenjun Liu, Yaning Jia, Peijun Qing, Lin Shi, Arman Cohan, Yujun Yan, Soroush Vosoughi

Outline

This paper analyzes bias in multi-agent systems that utilize large-scale language models (LLMs) as evaluators. Specifically, we evaluate four types of bias—position bias, detail bias, thought process bias, and opinion bias—in two frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge. Experimental results show that the debate framework significantly amplifies and persists bias after initial debate, while the meta-evaluator approach is more resistant to bias. Furthermore, integrating PINE, a single-agent bias mitigation technique, effectively reduces bias in the debate setting but is less effective in the meta-evaluator scenario. This study provides a comprehensive study of bias behavior in multi-agent LLM evaluation systems and highlights the need for targeted bias mitigation strategies in collaborative evaluation environments.

Takeaways, Limitations

Takeaways:
We provide a systematic analysis of how different types of bias manifest themselves in multi-agent LLM-as-Judge systems.
We highlight differences in the bias resistance of the argument framework and the meta-evaluator framework.
We evaluate the effectiveness of applying single-agent bias mitigation techniques to multi-agent systems and present their limitations.
It highlights the need for developing targeted bias mitigation strategies in collaborative evaluation environments.
Limitations:
The types of bias analyzed may be limited to four.
The characteristics of the LLM and dataset used for evaluation may influence the results.
Further research is needed to explore the generalizability of single-agent bias mitigation techniques, including PINE.
Further research is needed on various multi-agent LLM-as-Judge frameworks.
👍