Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Created by
  • Haebom

Author

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

Outline

This paper highlights that despite recent advances in Guided Reward-based Policy Optimization (GRPO), which improves human preference alignment in image and video generation models, high computational costs due to on-policy rollout and excessive SDE sampling steps, as well as training instability caused by sparse rewards, still persist. To address these issues, we propose BranchGRPO, a novel method that introduces a branching sampling policy to update the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO maintains or improves exploration diversity while significantly reducing per-update computational costs. Key contributions include reduced rollout and training costs through branching sampling techniques, a tree-based benefit estimator that incorporates dense process-level rewards, and improved convergence and performance through pruning strategies that leverage path and depth redundancy. Experimental results on image and video preference alignment show that BranchGRPO improves alignment scores by 16% over a robust baseline model while reducing training time by 50%.

Takeaways, Limitations

Takeaways:
We significantly improved the human preference alignment performance of GRPO-based image and video generation models (16% improvement).
We effectively reduced computational costs by reducing training time by 50%.
We propose new techniques, such as branch sampling, tree-based advantage estimators, and pruning strategies, which open new directions for future research.
Limitations:
The effectiveness of the proposed method may be limited to specific datasets and models. Additional experiments on diverse datasets and models are required.
The design of the dense reward may affect performance, and further research is needed to determine the optimal reward design.
Since parameter tuning of pruning strategies can affect performance, research on efficient parameter tuning methods is needed.
👍