Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

Created by
  • Haebom

Author

Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang

Outline

This paper proposes an efficient model for 3D visual grounding. Existing methods utilize separate encoders for RGB images, text, and 3D point clouds, resulting in large and complex models and inefficient training. In this paper, we propose a method that integrates all three modalities by leveraging a pre-trained 2D multimodal network. We apply adapter-based fine-tuning to the 2D CLIP model to effectively adapt to the trimodal setting, and the Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module fuses geometric multi-scale features of point clouds and images. We integrate text features for final modality fusion, and a multimodal decoder enables deep cross-modal understanding. As a result, we achieve a 6.52% performance improvement in 3D detection and a 6.25% performance improvement in 3D visual grounding, while reducing the number of parameters by approximately 58%.

Takeaways, Limitations

Takeaways:
Significantly improved the efficiency of the 3D visual grounding model (reduced parameters and improved performance).
We reduced the model complexity by leveraging a 2D pre-trained multi-modal network.
The GARF module effectively fuses the geometric features of the point cloud and the image.
Implemented an end-to-end 3D visual grounding model.
Limitations:
Further research is needed to determine whether the proposed method can be generalized to all types of 3D visual grounding tasks.
Further validation is needed to determine whether performance improvements for a specific dataset will translate equally well to other datasets.
There is a dependency on the 2D CLIP model. Limitations of the CLIP model may affect the performance of this model.
👍