Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Group-in-Group Policy Optimization for LLM Agent Training

Created by
  • Haebom

Author

Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An

Outline

This paper proposes Group-in-Group Policy Optimization (GiGPO), a novel algorithm that addresses the scalability challenges of long-term, large-scale language model (LLM) agent training using group-based reinforcement learning (RL). While maintaining the advantages of existing group-based RL (evaluator-free, low memory footprint, and stable convergence), it achieves fine-grained stage-level credit assignment through a hierarchical structure that computes relative advantages at both the episode and stage levels. At the episode level, the macroscopic relative advantage is calculated based on groups of completed trajectories, while at the stage level, the microscopic relative advantage is estimated by introducing an anchor state grouping mechanism that identifies recurring environmental states and inversely constructs stage-level groups. Evaluations on the ALFWorld and WebShop benchmarks using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct demonstrate performance gains of over 12% on ALFWorld and over 9% on WebShop compared to existing GRPO baselines. This approach maintains the same GPU memory overhead and LLM rollout, with little or no additional time overhead.

Takeaways, Limitations

Takeaways:
We present GiGPO, a novel efficient RL algorithm that addresses the scalability problem of long-term LLM agent training.
It enables fine-grained, step-by-step credit allocation while maintaining the advantages of existing group-based RL.
Experimentally verified performance improvement over existing algorithms in ALFWorld and WebShop benchmarks.
Achieve performance improvements without additional memory or time consumption.
Limitations:
The performance of the proposed algorithm may be limited to specific LLMs and benchmarks.
A more extensive comparative analysis with other RL algorithms is needed.
Further research is needed on the generality of the anchor state grouping mechanism and its applicability to various environments.
Performance evaluation is needed in complex environments or over longer time horizons.
👍