Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Mathematical Computation and Reasoning Errors by Large Language Models

Created by
  • Haebom

Author

Liang Zhang, Edith Aurora Graf

Outline

This paper presents the results of a study evaluating the accuracy of large-scale language models (LLMs), which are increasingly used for AI-based training and assessment in mathematics education. The study assessed the accuracy of solutions and inference errors at each stage for four LLMs: OpenAI GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1, solving three types of mathematical problems: arithmetic, algebra, and number theory. We intentionally created challenging problems that LLMs were prone to error, and the experiments were conducted in both single-agent and dual-agent configurations. The results showed that the OpenAI o1 model, with its enhanced reasoning capabilities, achieved the highest or near-perfect accuracy across all mathematical problem types. Error analysis revealed that procedural errors were the most frequent, significantly impacting overall performance, while conceptual errors were relatively rare. Using a dual-agent configuration significantly improved overall performance. These results provide actionable insights for improving LLM performance and highlight effective strategies for integrating LLMs into mathematics education, contributing to the accuracy of AI-based training and assessment.

Takeaways, Limitations

Takeaways:
We demonstrate that enhanced reasoning skills play a significant role in improving the accuracy of solving mathematical problems in LLM.
In the process of solving mathematics problems in LLM, procedural errors were revealed to be the main cause of errors.
We show that the performance of LLM can be significantly improved by using a dual-agent configuration.
Presenting actionable strategies to improve the accuracy of AI-based mathematics education and assessment.
Limitations:
The types and number of LLMs used are limited.
There may be a lack of variety in the difficulty and types of problems.
Further verification of the objectivity and reliability of error analysis is needed.
Further research is needed on its applicability in real-world mathematics education environments.
👍