In this paper, we propose an efficient model for 3D visual-based configuration. To address the problem that existing methods use separate encoders for RGB images, texts, and 3D point clouds, resulting in complex models and inefficient training, we present a method to integrate three modalities by utilizing a 2D pre-trained multi-modal network. We apply the 2D CLIP model to the tri-modal configuration through adapter-based fine-tuning, and fuse multi-scale features of point clouds and images through a geometrically-aware 2D-3D feature recovery and fusion (GARF) module. We integrate text features to perform the final modality fusion, and enable deep cross-modal understanding through a multi-modal decoder. As a result, we achieve a performance improvement of 6.52% in 3D detection and 6.25% in 3D visual-based configuration while reducing the number of parameters by about 58%.