[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Created by
  • Haebom

Author

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

Outline

In this paper, we propose a novel method, Video-3D Geometry Large Language Model (VG LLM), which enhances the ability of multimodal large language models (MLLMs) to understand and reason about 3D spaces using only video data without any additional 3D data input. The VG LLM utilizes a 3D visual geometry encoder to extract 3D prior information from video sequences, integrates it with visual tokens, and feeds it to the MLLM. Experimental results show that the proposed method significantly improves the performance on various tasks related to 3D scene understanding and spatial reasoning. In particular, the 4B model, which does not rely on explicit 3D data input, achieves competitive results with state-of-the-art methods and outperforms Gemini-1.5-Pro on a VSI-Bench evaluation.

Takeaways, Limitations

Takeaways:
We present an efficient method for 3D spatial understanding and inference using only video data.
3D scene understanding tasks can be performed without 3D data preprocessing.
Competitive performance compared to existing state-of-the-art models, and even surpassing them in some benchmarks.
Presenting new possibilities in MLLM-based 3D scene understanding research.
Limitations:
Further evaluation of the generalization performance of the proposed method and its performance on various video types is needed.
Lack of detailed description of the design and training process of 3D visual geometry encoder.
Dependency on specific video datasets and potential data bias.
Detailed comparative analysis is needed to support the claim that the 4B model outperforms the Gemini-1.5-Pro.
👍