Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

Created by
  • Haebom

Author

Jiawen Lin, Shiran Bian, Yihang Zhu, Wenbin Tan, Yachao Zhang, Yuan Xie, Yanyun Qu

Outline

Unlike supervised learning methods that achieve high accuracy in limited environments, this paper focuses on zero-shot 3D visual grounding (3DVG), which is advantageous for real-world applications. To address the spatial inference limitations of existing zero-shot methods and the problems of context omission or detail degradation, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images and spatial information. SeqVLM generates 3D instance proposals through a 3D semantic segmentation network and refines the proposals through semantic filtering, retaining only semantically relevant candidates. A proposal-based multi-view projection strategy projects candidate proposals onto real-world scene image sequences, preserving spatial relationships and contextual details during the 3D point cloud-to-image conversion process. Furthermore, to reduce the computational load of VLM, we implement a dynamic scheduling mechanism that repeatedly processes sequence-query prompts. This dynamic scheduling mechanism leverages VLM's cross-modal inference capabilities to identify text-specified objects. Experimental results on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, respectively, which are 4.0% and 5.2% better than existing zero-shot methods. The code is available at https://github.com/JiawLin/SeqVLM .

Takeaways, Limitations

Takeaways:
Improving zero-shot 3DVG performance and increasing generalizability by leveraging multi-view images and spatial information.
Reducing VLM computational load through dynamic scheduling mechanisms.
Achieving cutting-edge performance in ScanRefer and Nr3D benchmarks.
Increased real-world applicability.
Limitations:
The performance of the proposed method may depend on the performance of the 3D semantic segmentation network and VLM used.
Potential increase in computational costs due to multi-view image processing.
There may be a possibility of performance degradation for certain types of scenes or objects.
Further research is needed on generalization performance in various environments.
👍