In this paper, we propose ReconX, a novel method for accurate 3D scene reconstruction from limited viewpoint images. Unlike existing dense-view based reconstruction methods, ReconX reconstructs the sparse-view reconstruction problem as a temporal generation task by leveraging the powerful generative prior knowledge of pre-trained video diffusion models. First, we generate a global point cloud from limited input viewpoints and encode it into a 3D structure condition. Based on this condition, the video diffusion model synthesizes video frames with high 3D consistency while preserving details. Finally, we reconstruct the 3D scene from the generated video using a confidence-based 3D Gaussian Splatting optimization technique. Experimental results show that ReconX outperforms existing state-of-the-art methods in terms of performance and generalization capability.