This paper proposes an efficient model for 3D visual grounding. Existing methods utilize separate encoders for RGB images, text, and 3D point clouds, resulting in large and complex models and inefficient training. In this paper, we propose a method that integrates all three modalities by leveraging a pre-trained 2D multimodal network. We apply adapter-based fine-tuning to the 2D CLIP model to effectively adapt to the trimodal setting, and the Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module fuses geometric multi-scale features of point clouds and images. We integrate text features for final modality fusion, and a multimodal decoder enables deep cross-modal understanding. As a result, we achieve a 6.52% performance improvement in 3D detection and a 6.25% performance improvement in 3D visual grounding, while reducing the number of parameters by approximately 58%.