Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Created by
  • Haebom

Author

Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

Outline

This paper addresses the reliability of research on improving the inference performance of large-scale language models (LLMs) using reinforcement learning (RL). While previous research has demonstrated performance improvements in the Qwen2.5 family of models even with random or incorrect reward signals, we point out that this may be unreliable due to the potential for data contamination in benchmarks such as MATH-500, AMC, and AIME. Therefore, we present a new dataset, RandomCalculation, which generates completely clean arithmetic problems of arbitrary length and difficulty. Using this dataset, we demonstrate that only accurate reward signals improve the mathematical inference performance of models. We also conduct an in-depth analysis of the observed performance differences between the MATH-500 and RandomCalculation benchmarks, and propose that future research should utilize uncorrupted benchmarks and test a wider range of model families.

Takeaways, Limitations

Takeaways:
A study on improving the inference ability of LLM using reinforcement learning revealed the severity of data contamination.
We present a new benchmark RandomCalculation without data contamination.
We demonstrate that only accurate reward signals enhance the mathematical reasoning ability of LLM.
Presenting a reliable evaluation methodology for future research (using uncontaminated benchmarks and testing diverse model families).
Limitations:
The RandomCalculation dataset is limited to a specific domain (arithmetic problems).
The model used in the analysis focused on the Qwen2.5 series, requiring further research on generalizability.
Further research is needed to determine generalizability to other types of reward signals or reinforcement learning methods.
👍