Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Created by
  • Haebom

Author

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

Outline

This paper presents a method to improve the efficiency of reinforcement learning (RL) for improving the inference capability of multimodal large-scale language models (MLLMs). Existing RL pipelines suffer from two problems: "advantage collapsing" and "rollout silencing." This is because most advantages are concentrated near zero, and the proportion of rollouts that produce non-zero gradients decreases over time. To address these issues, we propose the Shuffle-R1 framework, which dynamically reconfigures trajectory sampling and batch configurations to improve RL fine-tuning efficiency. Shuffle-R1 introduces "pairwise trajectory sampling," which improves gradient signal quality by selecting trajectories with high contrast, and "advantage-based trajectory shuffling," which exposes valuable rollouts. Experimental results on various inference benchmarks demonstrate that Shuffle-R1 outperforms robust RL baseline models with minimal overhead.

Takeaways, Limitations

Takeaways:
We present a novel framework (Shuffle-R1) that significantly improves the efficiency of RL training to enhance the inference capability of MLLM.
This effectively addresses the advantage collapse and expansion silence problems, enabling optimized gradient updates.
We demonstrate that a data-driven approach can improve the efficiency of RL training.
Demonstrated superior performance compared to existing methods in various inference benchmarks.
Limitations:
Further research is needed on the generalization performance of Shuffle-R1.
It may only be effective for certain types of MLLM or inference tasks.
Lack of detailed analysis of the computational cost and complexity of the proposed method.
👍