This paper proposes CoDiff, a novel framework for improving collaborative 3D object detection performance in multi-agent systems. Existing collaborative 3D object detection methods generate feature representations containing spatial and temporal noise due to pose estimation errors and time delays, resulting in poor detection performance. CoDiff leverages a diffusion model to address these issues. It projects high-dimensional feature maps into the latent space of a pre-trained autoencoder and guides the sampling of the diffusion model based on information from each agent, thereby removing noise and improving the fused features. Experimental results using simulations and real-world datasets demonstrate that CoDiff outperforms existing methods in collaborative object detection and is robust even in the presence of high levels of noise in the agent pose and delay information.