Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Created by
  • Haebom

Author

Zijin Hong, Hao Wu, Su Dong, Junnan Dong, Yilin Xiao, Yujing Zhang, Zhu Wang, Feiran Huang, Linyi Li, Hongxia Yang, Xiao Huang

Outline

This paper addresses reliability issues (simplistic design and data contamination) in existing mathematical benchmarks and proposes RV-Bench, a novel benchmark for effectively evaluating the mathematical reasoning ability of large-scale language models (LLMs). RV-Bench uses a function to generate problems with random variables (RVQs), generating "unseen" problems similar to existing problems but with random combinations of variables. Since LLMs must fully understand the inherent patterns of the problems to accurately answer RVQs across diverse variable combinations, the accuracy and robustness of RV-Bench can be used to assess the true reasoning ability of LLMs. Experimental results using over 30 LLMs and over 1,000 RVQs demonstrate that LLMs exhibit proficiency imbalances between the distributions of seen and unseen data, and that proficiency generalization to similar mathematical reasoning tasks is limited, although this can be effectively induced through test time scaling.

Takeaways, Limitations

Takeaways:
Introducing RV-Bench, a new benchmark that overcomes the limitations of existing mathematical benchmarks.
LLM can assess true mathematical reasoning ability
Revealing the proficiency imbalance and limitations in generalization ability due to the data distribution of LLM
Suggesting the possibility of improving proficiency through test time scaling
Limitations:
Further research is needed on the versatility and scalability of RV-Bench.
More detailed analysis of the effects of test time scaling is needed.
The need to verify generalizability to various types of mathematical reasoning problems.
👍