[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop

Created by
  • Haebom

Author

Atharv Goel, Mehar Khurana

Outline

In this paper, we present a method to perform unannotated 3D object detection by leveraging a 2D vision-language model trained on web-scale image-text pairs to overcome the limitations of existing 3D object detection datasets (narrow class classification and expensive manual annotation). We generate text condition proposals using the 2D vision-language detector, segment them using SAM, and then project them to 3D using camera geometry and LiDAR or monocular pseudo-depth. We infer 3D bounding boxes without training using DBSCAN clustering and a geometric dilation strategy based on Rotating Calipers. We also construct Pseudo-nuScenes, a variant of the RGB-only nuScenes dataset with added fog to simulate harsh conditions of real environments. We experimentally demonstrate that it achieves competitive localization performance on multiple settings, including LiDAR-based and pure RGB-D inputs, and that it requires no training and supports an open vocabulary.

Takeaways, Limitations

Takeaways:
We demonstrate that open-vocabulary 3D object detection is possible without training using a 2D-based model.
Suggesting the possibility of increasing the scalability of 3D object detection by leveraging web-scale data.
We present a general methodology applicable to both LiDAR and RGB-D inputs.
Simulate real-world challenges and evaluate performance using the Pseudo-nuScenes dataset.
Increase reproducibility and scalability of research by making code and resources open.
Limitations:
Since it depends on the performance of the 2D model, a degradation in the performance of the 2D model can directly affect the 3D detection performance.
Because it relies on camera geometry and depth information, it may be difficult to extract accurate 3D information.
The Pseudo-nuScenes dataset may not perfectly reflect the real environment.
The accuracy of the 3D bounding box can be affected by the accuracy of the geometric inflation strategy.
👍