Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems

Created by
  • Haebom

Author

Yebo Peng, Zixiang Liu, Yaoming Li, Zhizhuo Yang, Xinye Xu, Bowen Ye, Weijun Yuan, Zihan Wang, Tong Yang

Outline

To address the challenges of assessing the mathematical ability of large-scale language models (LLMs), this paper proposes the Proof2Hybrid framework, which automatically generates high-quality proof-driven benchmarks from natural language-mathematical data. Through a roadmap called Proof2X, we transform mathematical proofs into diverse, easily verifiable questions. Specifically, we present a novel hybrid question format, "$m$-out-of-$n$ multiple judge questions," which are robust to guesswork and superficial pattern matching. We evaluate state-of-the-art LLMs using the AlgGeoTest (456-item) benchmark for algebraic geometry. We find significant deficiencies in the LLMs' understanding of algebraic geometry, demonstrating that this gap could be used to more accurately measure their mathematical ability. This study presents new possibilities for in-depth research on the mathematical intelligence of AI systems.

Takeaways, Limitations

Takeaways:
Presenting an automated framework (Proof2Hybrid) for assessing mathematical ability in LLM.
Proposing a new type of question format ("$m$-out-of-$n$ multiple judge questions") that overcomes the limitations of existing methods.
A new benchmark for algebraic geometry (AlgGeoTest) is available.
By quantitatively revealing the shortcomings of LLM's mathematical abilities, we suggest future research directions.
Limitations:
Further research is needed on the generality of the Proof2Hybrid framework and its applicability to other mathematical fields.
The scope of the AlgGeoTest benchmark is limited to algebraic geometry.
Further research is needed on the optimal $m$ and $n$ values for "$m$-out-of-$n$ multiple judge questions" format.
👍