Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Created by
  • Haebom

Author

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang

BranchGRPO: Efficient Human Preference Alignment for Image and Video Generation

Outline

Despite advances in human preference alignment using Group Relative Policy Optimization (GRPO) for image and video generation, existing approaches suffer from inefficiencies due to sequential rollout, excessive sampling steps, and sparse terminal rewards. In this paper, we propose BranchGRPO, which restructures the rollout process into a branching tree to distribute computation and eliminate low-value paths and redundant depths. BranchGRPO introduces a branching scheme that distributes rollout costs through a shared prefix, a reward fusion and depth-specific advantage estimator that convert sparse terminal rewards into dense step-level signals, and a pruning strategy that reduces gradient computation. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to 16% compared to DanceGRPO while reducing training time per iteration by approximately 55%. A hybrid variant, BranchGRPO-Mix, trains 4.7x faster than DanceGRPO without compromising alignment performance. On WanX video generation, BranchGRPO achieves higher Video-Align scores and sharper, temporally consistent frames than DanceGRPO.

Takeaways, Limitations

Takeaways:
Improving human preference alignment performance of image and video generation models.
Improve efficiency by reducing learning time.
Optimizing the rollout process using a branch tree structure.
Accurate signal transmission through reward fusion and advantage estimators.
Limitations:
Specific Limitations not explicitly mentioned in the paper (e.g., poor performance on certain datasets, complex implementation, etc.).
There is a possibility that this will be revealed through future research.
👍