This paper demonstrates that the vision-language alignment of CLIP, pre-trained on web crawling data, can be applied to subtasks without fine-tuning, even when the task is not optimally trained. Specifically, we focus on the monocular depth estimation task, and explore how CLIP's contrastive prior knowledge struggles to generalize, unlike its success in domains like generative models or semantic segmentation. To address CLIP's inability to consistently capture similarities between image patches and natural language prompts describing distances, we distill the semantic prior knowledge of a frozen text encoder into a single, trainable embedding matrix called "mirror," without using pre-trained natural language token embeddings. The primary design goal of mirror is to derive non-human language prompts that approximate optimal natural language prompts, such as "How far is this location from the camera?" Using this approach, we jointly train two lightweight modules (mirror and a compressed decoder) on top of frozen CLIP to perform dense depth prediction. Compared to existing depth models, it is significantly more efficient in terms of parameters and computation, and it performs similarly to several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, outperforming all vision-language depth models based on frozen CLIP prior knowledge. Experimental results demonstrate that CLIP's suboptimal depth understanding in terms of spatial and temporal consistency can be significantly corrected without fine-tuning CLIP or linking mirror to pre-trained subword token embeddings. Furthermore, a convergence study of mirror's ablation status demonstrates that it implicitly learns objects where semantic cues play a crucial role in detecting objects such as people and windows.