This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle
Created by
Haebom
Author
Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai
Outline
This paper presents a method to improve the efficiency of reinforcement learning (RL) for improving the inference capability of multimodal large-scale language models (MLLMs). Existing RL pipelines suffer from two problems: "advantage collapsing" and "rollout silencing." This is because most advantages are concentrated near zero, and the proportion of rollouts that produce non-zero gradients decreases over time. To address these issues, we propose the Shuffle-R1 framework, which dynamically reconfigures trajectory sampling and batch configurations to improve RL fine-tuning efficiency. Shuffle-R1 introduces "pairwise trajectory sampling," which improves gradient signal quality by selecting trajectories with high contrast, and "advantage-based trajectory shuffling," which exposes valuable rollouts. Experimental results on various inference benchmarks demonstrate that Shuffle-R1 outperforms robust RL baseline models with minimal overhead.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel framework (Shuffle-R1) that significantly improves the efficiency of RL training to enhance the inference capability of MLLM.
◦
This effectively addresses the advantage collapse and expansion silence problems, enabling optimized gradient updates.
◦
We demonstrate that a data-driven approach can improve the efficiency of RL training.
◦
Demonstrated superior performance compared to existing methods in various inference benchmarks.
•
Limitations:
◦
Further research is needed on the generalization performance of Shuffle-R1.
◦
It may only be effective for certain types of MLLM or inference tasks.
◦
Lack of detailed analysis of the computational cost and complexity of the proposed method.