This paper presents Putnam-AXIOM, a new benchmark for evaluating the mathematical reasoning ability of large-scale language models (LLMs). To address the overfitting problem inherent in existing benchmarks, the paper presents Putnam-AXIOM Variations, a set of 100 variant problems generated by modifying variables and constants, based on 522 problems from the prestigious William Lowell Putnam Mathematics Competition. Putnam-AXIOM Variations mitigates overfitting by generating an infinite number of new problems of similar difficulty. Experimental results show that even the top-performing model, OpenAI's o1-preview, achieved 41.9% accuracy on the original problem set, but its accuracy decreased by 19.6% on the variant set. This demonstrates the tendency of LLMs to simply memorize problems and highlights the need for a dynamic benchmark. In addition to measuring accuracy, the paper presents the Teacher-Forced Accuracy (TFA) metric, which directly evaluates the reasoning process. The data and evaluation code are publicly available.