Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SDPO: Importance-Sampled Direct Preference Optimization for Stable Diffusion Training

Created by
  • Haebom

Author

Xiaomeng Yang, Zhiyu Tan, Junyan Wang, Zhijian Zhou, Hao Li

Outline

This paper presents a study that applies preference learning techniques, used to align generative models with human expectations, to diffusion models. Existing methods such as Diffusion-DPO face two major challenges: time-step-dependent instability and off-policy bias due to mismatches between optimization and data collection policies. To address these issues, we propose DPO-CM, which improves stability and partially mitigates off-policy bias by clipping and masking irrelevant time steps, and Importance-Sampled Direct Preference Optimization (SDPO), which fully compensates for off-policy bias by incorporating importance sampling and emphasizes informative updates during the diffusion process. Experimental results on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B models show that both methods outperform the existing Diffusion-DPO, with SDPO achieving superior results in VBench scores, human preference alignment, and training robustness.

Takeaways, Limitations

Takeaways:
Identifying the causes of instability in diffusion model-based preference learning through time-step analysis.
Mitigating time-step instability and partially addressing off-policy bias through DPO-C M.
Leveraging importance sampling through SDPO, we achieve full off-policy bias compensation and superior performance.
Emphasize the importance of time-step awareness and distribution compensation optimization in diffusion-based preference learning.
Limitations:
There is no specific mention of Limitations in the paper.
👍