Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Tree Search for LLM Agent Reinforcement Learning

Created by
  • Haebom

Author

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu

Outline

Advances in reinforcement learning (RL) have improved the agent capabilities of large-scale language models (LLMs). Existing approaches that rely solely on outcome rewards in long-term, multi-stage agent tasks struggle with sparse supervision. To address this issue, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree exploration. Each tree node represents a complete agent interaction step. By sharing a common prefix, tree exploration sampling increases the number of achievable rollouts within a fixed token or tool call budget. Furthermore, the tree-structured trajectories allow for the natural construction of step-by-step process supervision signals even using outcome rewards alone. Based on this, Tree-GRPO estimates grouped relative advantages at both the intra- and inter-tree levels. Theoretical analysis demonstrates that the goal of intra-tree group relative policy optimization is identical to that of step-by-step direct preference learning. Experiments on 11 datasets and three types of QA tasks demonstrate that the proposed tree-based RL outperforms chain-based RL methods.

Takeaways, Limitations

Takeaways:
Tree-GRPO presents a novel reinforcement learning method that leverages tree search to solve the sparse supervision problem of LLM agents.
Learning efficiency was improved by configuring step-by-step supervision signals through a tree structure.
Improved policy optimization through grouped relative advantage estimation.
The superiority of the methodology was demonstrated through experiments on various datasets and tasks.
Limitations:
The specific Limitations is not specified in the paper (only a summary of the paper is provided).
👍