Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

Created by
  • Haebom

Author

Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Jian Hu, Yingying Zhang, Wenqiang Zhang

Outline

In this paper, we present a Zero-shot Tool-Integrated Reasoning (ZeroTIR) methodology that uses reinforcement learning (RL) to enable large-scale language models (LLMs) to spontaneously utilize external tools (Python code execution) to enhance their mathematical problem-solving abilities. The key is to train the LLM to generate and execute Python codes by applying RL with outcome-based rewards, without supervised tool usage examples. Experimental results show that the frequency of spontaneous code execution, response length, and final accuracy all increase positively with increasing RL training steps, suggesting a quantitative relationship between training effort and the acquisition of effective tool utilization strategies. We implement a robust framework using standard RL algorithms and frameworks, and demonstrate that it outperforms existing methods.

Takeaways, Limitations

Takeaways:
We demonstrate that outcome-based reward RL can effectively teach LLMs the ability to autonomously utilize external tools.
We provide a baseline for future research by elucidating the quantitative relationship between training phase and performance improvement.
The proposed ZeroTIR methodology outperforms existing methods in solving difficult mathematical problems.
We support follow-up research by making reproducible research environments and codes public.
Limitations:
Currently limited to running Python code, further research is needed on its extensibility to leverage other types of tools.
The range of mathematical problem benchmarks used may be limited, and performance evaluations on a wider variety of problem types are needed.
The computational cost of RL training can be significant, and further research is needed to develop efficient training methods.
👍