This paper points out the limitations of existing click-through rate (CTR) prediction methods, which are primarily based on ID modality and thus fail to comprehensively model diverse user preferences. We propose a novel framework for multimodal CTR prediction, Diffusion-based Multi-modal Synergy Interest Network (Diff-MSIN). Diff-MSIN consists of three modules: the Multi-modal Feature Enhancement (MFE) Module, the Synergistic Relationship Capture (SRC) Module, and the Feature Dynamic Adaptive Fusion (FDAF) Module. Each module focuses on extracting synergies, commonalities, and distinctiveness among various modalities, capturing user preferences, and reducing fusion noise. Experimental results using Rec-Tmall and three Amazon datasets show that Diff-MSIN outperforms existing methods by at least 1.67%.