[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Created by
  • Haebom

Author

ZZ Ren, Zhihong Shao, Junxiao Song, Huajian

Outline

DeepSeek-Prover-V2 is an open-source large-scale language model designed for formal theorem proving in Lean 4. It is initialized with data collected through a recursive theorem proving pipeline based on DeepSeek-V3. It performs cold-start training by leveraging DeepSeek-V3’s ability to decompose complex problems into multiple sub-goals. The proofs of solved sub-goals are combined with step-by-step inferences from DeepSeek-V3 to generate initial data for reinforcement learning. This integrates informal and formal mathematical reasoning. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural network theorem proving, achieving an 88.9% passing rate on MiniF2F-test and solving 49 out of 658 PutnamBench problems. We introduce a new metric, ProverBench (325 formalized problems), which includes 15 problems from the recent AIME competition (24-25). The evaluation results show that it solves 6 of the 15 AIME problems, which shows that the gap between formal and informal mathematical reasoning is significantly reduced compared to the majority voting method of DeepSeek-V3 (which solves 8 problems).

Takeaways, Limitations

Takeaways:
DeepSeek-Prover-V2 achieves state-of-the-art performance in neural network theorem proving.
We present a novel approach that integrates formal and informal reasoning.
A new benchmark, ProverBench, enables broader evaluation.
We show that the gap between formal and informal reasoning is narrowing in large-scale language models.
Limitations:
The solution rates for PutnamBench and AIME problems are not yet perfect (49/658 for PutnamBench and 6/15 for AIME).
The model may lack explanatory power about its inference process.
ProverBench needs to be further expanded in scale.
👍