In this paper, we propose ReconX, a novel method for performing sophisticated 3D scene reconstruction from limited input images. To address the problem that existing 3D scene reconstruction methods suffer from artifacts and distortions due to insufficient viewpoint information, ReconX reframes the sparse viewpoint reconstruction problem as a temporal generative task by leveraging the powerful generative priors of pre-trained video diffusion models. It generates a global point cloud based on input views, encodes it with context information to derive a video diffusion model, and synthesizes video frames with high 3D consistency while preserving details. Finally, it recovers the 3D scene from the generated videos via a confidence-based 3D Gaussian Splatting optimization technique. Experimental results show that ReconX outperforms existing state-of-the-art methods in terms of performance and generalization capability.