Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Created by
  • Haebom

Author

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

Outline

This paper identifies the limitations of the GSM8K benchmark, used to evaluate the mathematical reasoning ability of large-scale language models (LLMs), and proposes a new benchmark, GSM-Symbolic, to improve upon it. GSM-Symbolic generates diverse mathematical problems using symbolic templates, overcoming the limitations of existing evaluation methods and providing more reliable metrics. Our research reveals that SOTA LLMs exhibit performance differences across variations of the same problem, and that even simple changes to numerical values within the problem can degrade performance. Furthermore, we find that performance deteriorates significantly as the number of clauses in the problem increases. This suggests that LLMs do not perform true logical reasoning but instead mimic the reasoning steps in the training data. We found that adding even a single irrelevant clause can degrade performance by up to 65%. In conclusion, this study provides a more refined understanding of the mathematical reasoning ability of LLMs.

Takeaways, Limitations

Takeaways:
We present a new method to more accurately and reliably evaluate the mathematical reasoning ability of LLMs through the GSM-Symbolic benchmark.
We have identified a weakness in LLM's mathematical reasoning ability and attribute it to a lack of true logical reasoning.
We suggest new research directions for improving the mathematical reasoning ability of LLMs.
Limitations:
Although the GSM-Symbolic benchmark provides a more comprehensive assessment than GSM8K, it may still not fully reflect all types of mathematical reasoning problems.
Speculation about the cause of the decline in mathematical reasoning ability in LLMs needs to be verified through further research.
Results may vary depending on the type and size of the LLM model used in this study.
👍