To address the challenges of assessing the mathematical ability of large-scale language models (LLMs), this paper proposes the Proof2Hybrid framework, which automatically generates high-quality proof-driven benchmarks from natural language-mathematical data. Through a roadmap called Proof2X, we transform mathematical proofs into diverse, easily verifiable questions. Specifically, we present a novel hybrid question format, "$m$-out-of-$n$ multiple judge questions," which are robust to guesswork and superficial pattern matching. We evaluate state-of-the-art LLMs using the AlgGeoTest (456-item) benchmark for algebraic geometry. We find significant deficiencies in the LLMs' understanding of algebraic geometry, demonstrating that this gap could be used to more accurately measure their mathematical ability. This study presents new possibilities for in-depth research on the mathematical intelligence of AI systems.