While LLM-based benchmarks are widely used to evaluate complex model behavior, they introduce failure modes not present in traditional correct-response benchmarks. This paper argues that without a rigorous objective and verifiable constructs, benchmark rankings can generate highly reliable rankings that are effectively noisy. The authors propose two mechanisms to diagnose this problem. Schema compliance quantifies the extent to which a rater's overall verdict is explained by their explicit evaluation schema, revealing unexplained variance when raters deviate from their own rubrics. Psychometric validity quantifies the irreducible uncertainty of a benchmarking exercise by aggregating internal consistency and discriminant validity signals. Applying these tools to Arena-Hard Auto, the authors found significant schema inconsistency and factor collapse across widely used raters. For example, DeepSeek-R1-32B exhibited over 90% unexplained variance and factor correlations greater than 0.93 for most criteria. They also demonstrate that ELO-style aggregation collapses and obscures true ranking uncertainty. These results highlight design flaws that compromise validity and provide actionable principles for building reliability-aware LLM-based benchmarks with better coverage.