[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains

Created by
  • Haebom

Author

Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, Wentao Zhang

Outline

This paper addresses the problem of validating large-scale language models (LLMs) that enhance their inference capabilities through reinforcement learning. Consistency verification between model-generated responses and reference responses is challenging due to the length, variety, and nuance of responses. Rule-based verifiers struggle with complexity, model-based verifiers are used, but specialized verifiers lack flexibility, and general LLM judgers lack consistency. Existing research has focused on building better verifiers, but there is a lack of systematic cross-domain comparative evaluation of the performance of various types of verifiers, which limits the reliable development of reinforcement learning with verifiable rewards (RLVR). To address this, this paper proposes VerifyBench, a cross-domain comprehensive benchmark for systematically evaluating verifiers. It consists of 4,000 expert-level questions covering mathematics, physics, chemistry, and biology, along with reference answers and various responses for each question. The reliability of the evaluation is ensured through a rigorous annotation process conducted by a multidisciplinary team of experts. We design a four-dimensional experimental framework to comprehensively compare the performance boundaries of specialized verifiers and general LLMs under combined conditions of extracted answers vs. complete responses, short outputs vs. long outputs. The evaluation results reveal fundamental tradeoffs in the verifiers: the specialized verifier achieves high accuracy but suffers from poor recall, while the general model exhibits stronger comprehensiveness but suffers from unstable precision. More importantly, we find the high sensitivity of the verifier to the input structure and the inherent limitations in cross-domain generalization, which provide important insights into the bottlenecks of current verifier technologies.

Takeaways, Limitations

Takeaways: We have established a foundation for systematically comparing and evaluating the performance of LLM verifiers through the VerifyBench benchmark covering various domains. By clearly revealing the performance differences and limitations of specialized verifiers and general LLM verifiers, we have suggested the future development direction of LLM verifiers. We have emphasized the importance of generalization across input structures and domains, suggesting the focus of future research.
Limitations: VerifyBench consists of 4,000 questions, but the comprehensiveness of the benchmark needs to be increased by including more types of questions and answers. Further research is needed to minimize the subjectivity of expert evaluations used in the current benchmark. Although limitations in cross-domain generalization were revealed, no specific solutions were presented to overcome them.
👍