This paper identifies the limitations of the GSM8K benchmark, used to evaluate the mathematical reasoning ability of large-scale language models (LLMs), and proposes a new benchmark, GSM-Symbolic, to improve upon it. GSM-Symbolic generates diverse mathematical problems using symbolic templates, overcoming the limitations of existing evaluation methods and providing more reliable metrics. Our research reveals that SOTA LLMs exhibit performance differences across variations of the same problem, and that even simple changes to numerical values within the problem can degrade performance. Furthermore, we find that performance deteriorates significantly as the number of clauses in the problem increases. This suggests that LLMs do not perform true logical reasoning but instead mimic the reasoning steps in the training data. We found that adding even a single irrelevant clause can degrade performance by up to 65%. In conclusion, this study provides a more refined understanding of the mathematical reasoning ability of LLMs.