To address the challenges of generating sparse, multi-category 3D voxel structures, this paper proposes a novel generative model called Scaffold Diffusion. This method treats voxels as tokens and generates 3D voxel structures using a discrete diffusion language model. Unlike existing autoregressive methods, this model can generate realistic and consistent structures even with data sparsity exceeding 98%. Experimentally, we demonstrate this using Minecraft house structure data from the 3D-Craft dataset. Furthermore, we provide an interactive viewer that visualizes the generated samples and the generation process. Our findings highlight the promise of the discrete diffusion model as a promising framework for generative modeling of 3D sparse voxels.