Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Created by
  • Haebom

Author

Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao

Outline

In this paper, we propose ReMix, an off-policy reinforcement learning technique that addresses the computational inefficiency problem of existing on-policy reinforcement learning techniques to solve the __T83357__ of reinforcement learning (RL) for improving the inference ability of large-scale language models (LLMs). ReMix is designed to utilize off-policy data by leveraging on-policy RFT methods such as PPO and GRPO, and consists of three main components: Mixed-policy proximal policy gradient, KL-Convex policy constraint, and policy reincarnation. Experimental results show that ReMix achieves state-of-the-art performance on various mathematical inference benchmarks while reducing training cost by 30x to 450x compared to existing methods. In addition, we present insightful analyses, such as the short-response preference phenomenon due to the whipping effect of off-policy mismatch and the self-reflective behavioral collapse mode in severe off-policy situations.

Takeaways, Limitations

Takeaways:
We present a novel off-policy technique, ReMix, which effectively addresses the inefficient computational cost problem of existing on-policy reinforcement learning-based LLM inference capability enhancement methods.
Achieve state-of-the-art (SOTA) performance on a variety of mathematical inference benchmarks and dramatically reduce training costs (amount of rollout data) (30x–450x).
We present key techniques (Mix-policy proximal policy gradient, KL-Convex policy constraint, Policy reincarnation) that improve the efficiency and stability of off-policy reinforcement learning.
Provides in-depth analysis and insights into phenomena occurring during off-policy learning processes (Whipping Effect, breakdown of self-reflective behavior, etc.).
Limitations:
ReMix's performance gains may be limited to specific math inference benchmarks. Generalization to other types of tasks is needed.
Further research is needed to address bias and stability issues associated with the use of off-policy data.
Further validation of the generality and applicability of the presented analytical results is needed.
👍