This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper presents TrustGeoGen, a data engine that generates formally validated geometric problems to build a reliable benchmark for mathematical geometry problem solving (GPS). TrustGeoGen integrates four core innovations—multimodal alignment, formal verification, connected thinking, and the GeoExplore algorithm series—to generate a variety of problem variants with diverse solutions and self-reflective tracking capabilities. Using this engine, we generated the GeoTrust-200K dataset and the GeoTrust-test benchmark, which guarantee cross-modal integrity. Experimental results demonstrate the difficulty of this benchmark, with a state-of-the-art model achieving only 45.83% accuracy on GeoTrust-test. Furthermore, training with synthetic data significantly improves model performance on GPS tasks and enhances generalization to out-of-domain (OOD) benchmarks. Code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen .
Contributing to the advancement of research in the field of geometric problem solving (GPS) by providing officially verified geometric problem datasets, GeoTrust-200K and GeoTrust-test benchmarks.
◦
We demonstrate that training using synthetic data generated through the TrustGeoGen engine is effective in improving model performance for GPS tasks and enhancing cross-domain generalization performance.
◦
Solving the hallucination problem of the existing LLM Limitations and suggesting the possibility of building a reliable GPS dataset.
•
Limitations:
◦
There is a need to further expand the scale of the GeoTrust-200K dataset in the future.
◦
Further validation is needed to ensure that the TrustGeoGen engine's generation capabilities can fully handle all types of geometric problems.
◦
Current benchmarks show that state-of-the-art models perform less than 50% of the time, suggesting that there are still many challenges to overcome.