Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

Created by
  • Haebom

Author

Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun

SPEAR: Curriculum-based Self-Imitation Learning for Agentic LLMs

Outline

Reinforcement learning (RL) is the dominant paradigm for improving strategic tool use in long-term, scarce-reward agent tasks like LLM, but it faces a fundamental problem of exploration-exploitation. This paper addresses the shortcomings of existing research that stimulates exploration through policy entropy and aims to achieve a gradual exploration-exploitation balance without entropy decay or divergence based on the agent's own experience. SPEAR proposes a curriculum-based self-imitation learning (SIL) approach to training LLM agents. Extending the SIL framework, SPEAR performs off-policy updates by storing self-generated promising trajectories in a replay buffer, and incrementally evolves the policy while maintaining a balanced entropy range at each step. SPEAR utilizes intrinsic rewards to promote skill-level exploration and incorporates a curriculum that facilitates action-level exploration through SIL. Early in training, auxiliary tool invocation rewards play a crucial role in accumulating tool use skills. As training progresses, self-imitation is enhanced to leverage existing successful patterns, and regularization is introduced to suppress overconfidence to control trajectory-level entropy.

Takeaways, Limitations

Takeaways:
A novel methodology for resolving the exploration-exploitation dilemma in RL-based LLM training.
Balancing exploration and exploitation through a curriculum-based, self-imitating learning (SIL) approach.
A step-by-step learning approach to acquiring tool use skills is presented.
Introducing various regularization techniques for training stability.
Limitations:
Lack of information on specific experimental results and performance comparisons (not available in the paper abstract).
Absence of comparative analysis information with other RL-based methodologies.
Limited information about performance in a specific environment (long-term, scarce reward environment).
👍