Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Created by
  • Haebom

Author

Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson

Outline

While LLM-based benchmarks are widely used to evaluate complex model behavior, they introduce failure modes not present in traditional correct-response benchmarks. This paper argues that without a rigorous objective and verifiable constructs, benchmark rankings can generate highly reliable rankings that are effectively noisy. The authors propose two mechanisms to diagnose this problem. Schema compliance quantifies the extent to which a rater's overall verdict is explained by their explicit evaluation schema, revealing unexplained variance when raters deviate from their own rubrics. Psychometric validity quantifies the irreducible uncertainty of a benchmarking exercise by aggregating internal consistency and discriminant validity signals. Applying these tools to Arena-Hard Auto, the authors found significant schema inconsistency and factor collapse across widely used raters. For example, DeepSeek-R1-32B exhibited over 90% unexplained variance and factor correlations greater than 0.93 for most criteria. They also demonstrate that ELO-style aggregation collapses and obscures true ranking uncertainty. These results highlight design flaws that compromise validity and provide actionable principles for building reliability-aware LLM-based benchmarks with better coverage.

Takeaways, Limitations

We highlight design issues with LLM-based benchmarks: their rankings can be noisy due to strict objectives and lack of verifiable constructs.
Suggesting a diagnostic mechanism: Evaluating the reliability of the benchmark using schema compliance and psychometric validity.
Analysis of Arena-Hard Auto: Finding serious schema inconsistencies and factor collapse, and pointing out problems with ELO-style aggregation.
Directions for improvement: Proposing principles for building LLM-based benchmarks with better scope and reliability.
Limitations: Focuses on analysis of a specific benchmark (Arena-Hard Auto).
👍