[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

Created by
  • Haebom

Author

Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang

Outline

In this paper, we propose an efficient model for 3D visual-based configuration. To address the problem that existing methods use separate encoders for RGB images, texts, and 3D point clouds, resulting in complex models and inefficient training, we present a method to integrate three modalities by utilizing a 2D pre-trained multi-modal network. We apply the 2D CLIP model to the tri-modal configuration through adapter-based fine-tuning, and fuse multi-scale features of point clouds and images through a geometrically-aware 2D-3D feature recovery and fusion (GARF) module. We integrate text features to perform the final modality fusion, and enable deep cross-modal understanding through a multi-modal decoder. As a result, we achieve a performance improvement of 6.52% in 3D detection and 6.25% in 3D visual-based configuration while reducing the number of parameters by about 58%.

Takeaways, Limitations

Takeaways:
Significantly improved the efficiency of 3D visual-based setup models (reduced parameters and improved performance).
We reduced model complexity by leveraging a 2D pre-trained multi-modal network.
We effectively fused the geometric features of images and point clouds using the GARF module.
Excellent performance was achieved in 3D visual-based settings.
Limitations:
Further studies are needed to determine whether the proposed method can be generalized to all types of 3D visual-based setting problems.
The dependency on a specific 2D CLIP model may be a limitation. Applicability to other 2D models needs to be evaluated.
There is a lack of evaluation of robustness in real environments.
Performance evaluation on large datasets is required.
👍