Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Created by
  • Haebom

Author

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

Outline

This paper discusses recent advances in Guided Reward Policy Optimization (GRPO), which improves human preference alignment in image and video generation models. Existing GRPO suffers from high computational costs due to on-policy rollout and excessive Stochastic Differential Equation (SDE) sampling steps, as well as training instability caused by sparse rewards. To address these issues, we propose BranchGRPO, a novel method that introduces a branching sampling policy to update the SDE sampling process. By sharing computation across common prefixes and pruning low-reward paths and redundant depths, BranchGRPO significantly reduces per-update computational costs while maintaining or improving exploration diversity. Key contributions include reduced rollout and training costs through branching sampling techniques, a tree-based benefit estimator that incorporates dense process-level rewards, and improved convergence and performance through pruning strategies that leverage path and depth redundancy. Experimental results demonstrate that BranchGRPO improves alignment scores by 16% and reduces training time by 50% compared to a robust baseline model.

Takeaways, Limitations

Takeaways:
We present a novel method (BranchGRPO) that effectively addresses the computational cost and training instability issues of GRPO.
Improving human preference alignment performance of image and video generation models by reducing training time (50%) and improving alignment scores (16%).
We present novel techniques such as branch sampling, tree-based advantage estimators, and pruning strategies.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Additional experiments on various datasets and models are needed.
Possible lack of detailed description of the complexity and optimization process of dense compensation design.
👍