In this paper, we propose a novel method, Video-3D Geometry Large Language Model (VG LLM), which enhances the ability of multimodal large language models (MLLMs) to understand and reason about 3D spaces using only video data without any additional 3D data input. The VG LLM utilizes a 3D visual geometry encoder to extract 3D prior information from video sequences, integrates it with visual tokens, and feeds it to the MLLM. Experimental results show that the proposed method significantly improves the performance on various tasks related to 3D scene understanding and spatial reasoning. In particular, the 4B model, which does not rely on explicit 3D data input, achieves competitive results with state-of-the-art methods and outperforms Gemini-1.5-Pro on a VSI-Bench evaluation.