Existing unsupervised keypoint detection methods apply artificial transformations, such as masking significant portions of the image or using the original image reconstruction as the learning objective. However, these approaches lack depth information and often detect keypoints in the background. To address this issue, we propose Distill-DKP, a novel cross-modal knowledge distillation framework that utilizes depth maps and RGB images to detect keypoints in a self-supervised manner. During training, Distill-DKP extracts embedding-level knowledge from a depth-based teacher model to guide an image-based student model, restricting inference to the student model. Experimental results demonstrate that Distill-DKP significantly outperforms existing unsupervised learning methods, reducing the average L2 error by 47.15% on the Human3.6M dataset, reducing the average error by 5.67% on the Taichi dataset, and improving keypoint accuracy by 1.3% on the DeepFashion dataset. A detailed ablation study demonstrates the sensitivity of knowledge distillation across different layers of the network.