Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

Created by
  • Haebom

Author

Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang

Outline

Advances in reinforcement learning (RL) have improved the inference capabilities of large-scale language models (LLMs). However, Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), has limited its effectiveness for complex inference tasks due to limitations in exploration and training instability. To address these issues, we propose the EFRame framework, which incorporates exploration, filtering, and replay. EFRame enables deeper and more targeted exploration with additional rollouts, removes low-quality samples to stabilize gradients and accelerate training, and amplifies rare but informative trajectories through experience replay to achieve stable convergence. EFRame demonstrates a 37.9% relative performance improvement over GRPO on the Geometry3K geometry problem-solving benchmark. EFRame supports fine-grained sample classification and precise entropy control, highlighting its potential as a powerful solution for advancing deeper LLM inference.

Takeaways, Limitations

Takeaways:
EFRame improves the inference capability of LLM by addressing the search, efficiency, and stability issues of GRPO.
It showed a 37.9% performance improvement over GRPO in the Geometry3K benchmark.
Supports fine-grained sample classification and precise entropy control.
Limitations:
There is no specific mention of Limitations in the paper (not specified in the Abstract).
👍