This paper presents the results of a study evaluating the accuracy of large-scale language models (LLMs), which are increasingly used for AI-based training and assessment in mathematics education. The study assessed the accuracy of solutions and inference errors at each stage for four LLMs: OpenAI GPT-4o, OpenAI o1, DeepSeek-V3, and DeepSeek-R1, solving three types of mathematical problems: arithmetic, algebra, and number theory. We intentionally created challenging problems that LLMs were prone to error, and the experiments were conducted in both single-agent and dual-agent configurations. The results showed that the OpenAI o1 model, with its enhanced reasoning capabilities, achieved the highest or near-perfect accuracy across all mathematical problem types. Error analysis revealed that procedural errors were the most frequent, significantly impacting overall performance, while conceptual errors were relatively rare. Using a dual-agent configuration significantly improved overall performance. These results provide actionable insights for improving LLM performance and highlight effective strategies for integrating LLMs into mathematics education, contributing to the accuracy of AI-based training and assessment.