This paper proposes SGDFuse, a conditional diffusion model using the Segment Anything Model (SAM), to address the shortcomings of existing methods in infrared-visible image fusion (IVIF), including a lack of deep semantic understanding and artifacts and loss of detail during the fusion process. SGDFuse optimizes the fusion process through a conditional diffusion model, utilizing the high-quality semantic masks generated by the SAM as prior information. The two-step process involves first performing preliminary fusion of multimodal features, and then generating a coarse-to-fine denoising model based on the semantic masks from the SAM and the preliminary fused image. This ensures both semantic directionality and high-fidelity results. Experimental results demonstrate that SGDFuse achieves state-of-the-art performance in terms of subjective and objective evaluations, as well as applicability to subsequent tasks. The source code is available on GitHub.