Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

Created by
  • Haebom

Author

Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang

Outline

This paper presents TrustGeoGen, a data engine that generates formally validated geometric problems to build a reliable benchmark for mathematical geometry problem solving (GPS). TrustGeoGen integrates four core innovations—multimodal alignment, formal verification, connected thinking, and the GeoExplore algorithm series—to generate a variety of problem variants with diverse solutions and self-reflective tracking capabilities. Using this engine, we generated the GeoTrust-200K dataset and the GeoTrust-test benchmark, which guarantee cross-modal integrity. Experimental results demonstrate the difficulty of this benchmark, with a state-of-the-art model achieving only 45.83% accuracy on GeoTrust-test. Furthermore, training with synthetic data significantly improves model performance on GPS tasks and enhances generalization to out-of-domain (OOD) benchmarks. Code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen .

Takeaways, Limitations

Takeaways:
Contributing to the advancement of research in the field of geometric problem solving (GPS) by providing officially verified geometric problem datasets, GeoTrust-200K and GeoTrust-test benchmarks.
We demonstrate that training using synthetic data generated through the TrustGeoGen engine is effective in improving model performance for GPS tasks and enhancing cross-domain generalization performance.
Solving the hallucination problem of the existing LLM Limitations and suggesting the possibility of building a reliable GPS dataset.
Limitations:
There is a need to further expand the scale of the GeoTrust-200K dataset in the future.
Further validation is needed to ensure that the TrustGeoGen engine's generation capabilities can fully handle all types of geometric problems.
Current benchmarks show that state-of-the-art models perform less than 50% of the time, suggesting that there are still many challenges to overcome.
👍