This paper presents a study that applies preference learning techniques, used to align generative models with human expectations, to diffusion models. Existing methods such as Diffusion-DPO face two major challenges: time-step-dependent instability and off-policy bias due to mismatches between optimization and data collection policies. To address these issues, we propose DPO-CM, which improves stability and partially mitigates off-policy bias by clipping and masking irrelevant time steps, and Importance-Sampled Direct Preference Optimization (SDPO), which fully compensates for off-policy bias by incorporating importance sampling and emphasizes informative updates during the diffusion process. Experimental results on CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B models show that both methods outperform the existing Diffusion-DPO, with SDPO achieving superior results in VBench scores, human preference alignment, and training robustness.