Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

Created by
  • Haebom

Author

Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yue Wang, Yuzhi Zhang

Outline

In this paper, we present the EFRame framework to improve the performance of the Group Relative Policy Optimization (GRPO) algorithm, which suffers from limited exploration, low sample efficiency, and instability in complex inference tasks. EFRame systematically improves GRPO by introducing additional rollouts to explore high-quality trajectories, online filtering to remove low-quality samples that cause noise and variance, and empirical reproducibility to repeatedly utilize rare but informative samples. Through various inference benchmark experiments, we demonstrate that EFRame not only improves the robustness and efficiency of training, but also enables deeper inference capabilities that were not possible with conventional GRPO. Furthermore, EFRame enables more fine-grained classification of training samples, which allows for a deeper analysis of how different types of samples contribute to the reinforcement learning process.

Takeaways, Limitations

Takeaways:
We present the EFRame framework that effectively addresses the limited exploration, low sample efficiency, and instability issues of GRPO's Limitations.
Achieving more robust and efficient reinforcement learning training and deeper inference capabilities with EFRame.
In-depth analysis of the reinforcement learning process through fine-grained classification of training samples.
Increase reproducibility and usability by making your code public via GitHub.
Limitations:
The types and scope of the benchmarks presented in this paper may be limited. Additional experiments on various types of inference tasks may be needed.
There is a possibility that EFRame's performance gains may be biased towards certain types of inference tasks or datasets.
Detailed analysis of EFRame's computational cost and memory usage may be lacking.
👍