Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

Created by
  • Haebom

Author

Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo

Outline

This paper presents Putnam-AXIOM, a new benchmark for evaluating the mathematical reasoning ability of large-scale language models (LLMs). To address the overfitting problem inherent in existing benchmarks, the paper presents Putnam-AXIOM Variations, a set of 100 variant problems generated by modifying variables and constants, based on 522 problems from the prestigious William Lowell Putnam Mathematics Competition. Putnam-AXIOM Variations mitigates overfitting by generating an infinite number of new problems of similar difficulty. Experimental results show that even the top-performing model, OpenAI's o1-preview, achieved 41.9% accuracy on the original problem set, but its accuracy decreased by 19.6% on the variant set. This demonstrates the tendency of LLMs to simply memorize problems and highlights the need for a dynamic benchmark. In addition to measuring accuracy, the paper presents the Teacher-Forced Accuracy (TFA) metric, which directly evaluates the reasoning process. The data and evaluation code are publicly available.

Takeaways, Limitations

Takeaways:
A new benchmark, Putnam-AXIOM, is presented to address the overfitting problem of existing benchmarks.
Provides objective and rigorous assessment criteria for LLM's mathematical reasoning ability
Revealing the tendency of simple memorization in LLM and emphasizing the need for dynamic benchmarking
Teacher-Forced Accuracy (TFA), a new metric for evaluating reasoning processes, is proposed.
Provides an in-depth analysis of the current state of mathematical reasoning capabilities of large-scale language models.
Limitations:
Putnam-AXIOM focuses on advanced mathematics problems, so its applicability to assessing reasoning skills in other areas may be limited.
Further research is needed to determine the generality and objectivity of the TFA indicator.
The possibility of difficulties in generalizing due to the specific nature of the Putnam competition problem.
👍