This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper addresses reliability issues (simplistic design and data contamination) in existing mathematical benchmarks and proposes RV-Bench, a novel benchmark for effectively evaluating the mathematical reasoning ability of large-scale language models (LLMs). RV-Bench uses a function to generate problems with random variables (RVQs), generating "unseen" problems similar to existing problems but with random combinations of variables. Since LLMs must fully understand the inherent patterns of the problems to accurately answer RVQs across diverse variable combinations, the accuracy and robustness of RV-Bench can be used to assess the true reasoning ability of LLMs. Experimental results using over 30 LLMs and over 1,000 RVQs demonstrate that LLMs exhibit proficiency imbalances between the distributions of seen and unseen data, and that proficiency generalization to similar mathematical reasoning tasks is limited, although this can be effectively induced through test time scaling.
Takeaways, Limitations
•
Takeaways:
◦
Introducing RV-Bench, a new benchmark that overcomes the limitations of existing mathematical benchmarks.
◦
LLM can assess true mathematical reasoning ability
◦
Revealing the proficiency imbalance and limitations in generalization ability due to the data distribution of LLM
◦
Suggesting the possibility of improving proficiency through test time scaling
•
Limitations:
◦
Further research is needed on the versatility and scalability of RV-Bench.
◦
More detailed analysis of the effects of test time scaling is needed.
◦
The need to verify generalizability to various types of mathematical reasoning problems.