This paper proposes SGDFuse, a conditional diffusion model using the Segment Anything Model (SAM), to address the shortcomings of existing methods in infrared-visible image fusion (IVIF), including a lack of deep semantic understanding, artifact generation, and loss of detail. SGDFuse optimizes the fusion process through a conditional diffusion model, leveraging the high-quality semantic masks generated by the SAM as explicit prior information. The two-step process involves first performing preliminary fusion of multimodal features, and then denoising the diffusion model from coarse to fine, conditioned on the semantic masks from the SAM and the preliminary fused image. This ensures semantic directionality and high fidelity of the final result. Experimental results demonstrate that SGDFuse achieves state-of-the-art performance in terms of subjective and objective evaluations, as well as applicability to downstream tasks. The source code is available on GitHub.