Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D

Created by
  • Haebom

Author

Mohamad Amin Mirzaei, Pantea Amoie, Ali Ekhterachian, Matin Mirzababaei, Babak Khalaj

Outline

3D scene understanding is a core competency in embodied AI and robotics, supporting reliable perception for interaction and exploration. This paper performs zero-shot, open-vocabulary 3D semantic mapping by assigning embedding vectors to 2D class-agnostic masks generated via Vision-Language Models (VLMs) and projecting them into 3D. This study utilizes SemanticSAM and progressive granularity refinement to generate more accurate and numerous object-level masks, mitigating the oversegmentation problem commonly observed in mask generation models. Furthermore, a context-aware CLIP encoding strategy integrates multiple contextual views of each mask to enrich visual context. We evaluate the effectiveness of the proposed approach on various 3D scene understanding tasks, demonstrating significant improvements over existing methods.

Takeaways, Limitations

Takeaways:
Mitigating oversegmentation issues by leveraging SemanticSAM and progressive granularity improvements for improved object-level mask generation.
Providing rich visual context through context-aware CLIP encoding strategy.
Significant performance improvements over existing methods in several 3D scene understanding tasks, such as 3D semantic segmentation and language query-based object retrieval.
Limitations:
There is no specific Limitations stated in the paper.
👍