Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

Created by
  • Haebom

Author

Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh

Outline

APOLLO is an automated theorem proving system that combines the Lean compiler and the reasoning power of LLM to prove mathematical theorems. While conventional LLM-based theorem proving methods generate proofs that pass verification systems through numerous (thousands) samplings, APOLLO uses a modular pipeline in which LLM generates proofs, agents correct syntactic errors in the proofs, Lean identifies mistakes, isolates failed sub-theorems, utilizes an automatic solver, and inexpensively calls LLM on the remaining objectives. This process of recombining and re-verifying the corrected sub-proofs is repeated up to a user-controllable maximum number of attempts. On the miniF2F benchmark, it achieves 84.9% accuracy among models with fewer than 8B parameters, while maintaining a sampling cost of less than 100. Furthermore, it improves the accuracy of GoedelProverSFT to 65.6% and reduces the sample complexity from 25,600 to several hundred.

Takeaways, Limitations

Takeaways:
We demonstrate that targeted compiler guide modifications to LLM output significantly improve efficiency and accuracy.
Presenting a scalable general paradigm for automatic theorem proving.
High accuracy achieved at low sampling cost (84.9% on miniF2F benchmark, 65.6% on GoedelProverSFT).
Significantly improved the accuracy of general-purpose models (from 3-7% to over 40%).
Limitations:
Currently, only performance for specific benchmarks (miniF2F, GoedelProverSFT) is presented, and generalizability to other benchmarks or more complex theorems requires further research.
APOLLO's performance depends on the LLM and Lean compilers used, and further research is needed on its compatibility and portability with other systems.
Further research is needed on parameter optimization, such as setting a user-controllable maximum number of attempts.
👍