This paper addresses the problem of generating large-scale scene data for robot learning. Existing neural network-based reconstruction methods are useful for reconstructing large-scale outdoor scenes based on real-world environments, but they are limited to static environments and lack scene and trajectory diversity. Conversely, recent image or video diffusion models offer controllability but lack geometric foundations and causality. To overcome these limitations, this study presents a method for directly generating large-scale 3D driving scenes with accurate geometric information. The proposed method combines proxy geometry and environment representation generation with score distillation from learned 2D image priors, providing high controllability and enabling the generation of realistic and geometrically consistent 3D complex driving scenes, conditioned on map layout.