Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Created by
  • Haebom

Author

Xinji Mai, Haotian Xu, Zhong-Zhi Li, Xing W, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang

Outline

This paper presents the ZeroTIR framework, which performs Tool-Integrated Reasoning (TIR) using reinforcement learning (RL) from outcome-based rewards. ZeroTIR trains a pre-trained large-scale language model (LLM) to spontaneously generate and execute Python code for mathematical problems, without supervised learning examples of tool usage. Experimental results show a strong positive correlation between increasing RL training steps and the frequency of spontaneous code execution, average response length, and final task accuracy. This quantitatively demonstrates the relationship between the computational effort invested in training and the emergence of effective tool-augmented reasoning strategies. We also demonstrate that ZeroTIR significantly outperforms existing tool-less ZeroRL baseline models on mathematical benchmarks. By providing a robust framework and reproducible benchmarks, we contribute to future research.

Takeaways, Limitations

Takeaways:
We demonstrate that outcome-based reward RL can enable LLMs to voluntarily utilize external tools (executing Python code) to improve their mathematical reasoning abilities.
We deepen our understanding of the tool learning process by uncovering quantitative correlations between RL training steps and code execution frequency, response length, and accuracy.
The ZeroTIR framework contributes to future tool-based inference research by providing reproducible benchmarks.
We present a new methodology to improve the efficiency of tool use learning.
Limitations:
Currently, it is limited to mathematical problems, and its generalizability to other types of problems requires further research.
Performance may vary depending on the type and scope of the benchmark used.
Consideration must be given to the stability and security of the code execution environment.
There may be a lack of interpretability for complex reasoning processes.
👍