This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Unlike supervised learning methods that achieve high accuracy in limited environments, this paper focuses on zero-shot 3D visual grounding (3DVG), which is advantageous for real-world applications. To address the spatial inference limitations of existing zero-shot methods and the problems of context omission or detail degradation, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images and spatial information. SeqVLM generates 3D instance proposals through a 3D semantic segmentation network and refines the proposals through semantic filtering, retaining only semantically relevant candidates. A proposal-based multi-view projection strategy projects candidate proposals onto real-world scene image sequences, preserving spatial relationships and contextual details during the 3D point cloud-to-image conversion process. Furthermore, to reduce the computational load of VLM, we implement a dynamic scheduling mechanism that repeatedly processes sequence-query prompts. This dynamic scheduling mechanism leverages VLM's cross-modal inference capabilities to identify text-specified objects. Experimental results on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, respectively, which are 4.0% and 5.2% better than existing zero-shot methods. The code is available at https://github.com/JiawLin/SeqVLM .
Takeaways, Limitations
•
Takeaways:
◦
Improving zero-shot 3DVG performance and increasing generalizability by leveraging multi-view images and spatial information.
◦
Reducing VLM computational load through dynamic scheduling mechanisms.
◦
Achieving cutting-edge performance in ScanRefer and Nr3D benchmarks.
◦
Increased real-world applicability.
•
Limitations:
◦
The performance of the proposed method may depend on the performance of the 3D semantic segmentation network and VLM used.
◦
Potential increase in computational costs due to multi-view image processing.
◦
There may be a possibility of performance degradation for certain types of scenes or objects.
◦
Further research is needed on generalization performance in various environments.