This paper analyzes the limitations of cross-modal contrastive distillation (CMCR) for 3D representation learning and proposes a novel framework, CMCR, to improve upon it. To address the problem that existing methods focus solely on modal shared features while overlooking modal-specific features, we introduce masked image modeling and occupancy estimation tasks to induce more comprehensive modal-specific feature learning. Furthermore, we propose a multi-modal unified codebook that learns shared embedding spaces across various modalities, and geometrically enhanced mask image modeling to enhance 3D representation learning performance. Experimental results demonstrate that CMCR outperforms existing image-LiDAR contrastive distillation methods in downstream tasks.