This paper proposes CleverDistiller, a novel self-supervised learning-based cross-modal knowledge distillation (KD) framework for transferring generalized features from 2D image-based Vision Foundation Models (VFMs) to 3D LiDAR-based models. Unlike previous studies that employ complex loss functions, pseudo-semantic maps, and knowledge transfer limited to semantic segmentation, CleverDistiller learns complex semantic dependencies through simple yet effective design choices and enables direct knowledge transfer from VFMs without pseudo-semantic maps. Furthermore, it introduces an auxiliary self-supervised spatial task called occupancy prediction to enhance 3D spatial reasoning capabilities based on semantic knowledge acquired from VFMs. Experimental results on autonomous driving benchmarks demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection, with performance gains becoming more pronounced when fine-tuned with limited data.