To address the challenges of remote sensing image-to-text retrieval, this paper proposes PriorCLIP, a visual-language model leveraging visual prior information. PriorCLIP leverages visual prior information for unbiased representation learning and adaptive image-to-language alignment. In a closed-domain setting, PriorCLIP utilizes spatial and temporal Progressive Attention Encoder (PAE) architectures to filter salient features, mitigate semantic bias, and enhance text representations. In an open-domain setting, PriorCLIP designs a two-stage dictionary representation learning strategy consisting of large-scale dictionary training on coarse image-to-text pairs and fine-tuning using visual indicators, enabling robust retrieval of long-tail concepts and lexical variations. Furthermore, we propose a cluster-based symmetric contrastive attribution loss to constrain inter-class relationships and mitigate semantic confusion in a shared embedding space. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves significant performance gains over existing methods: 4.9% and 4.0% in closed-domain retrieval, and 7.3% and 9.4% in open-domain retrieval.