Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Created by
  • Haebom

Author

Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai, Shengyong Chen

Outline

To address the challenges of remote sensing image-to-text retrieval, this paper proposes PriorCLIP, a visual-language model leveraging visual prior information. PriorCLIP leverages visual prior information for unbiased representation learning and adaptive image-to-language alignment. In a closed-domain setting, PriorCLIP utilizes spatial and temporal Progressive Attention Encoder (PAE) architectures to filter salient features, mitigate semantic bias, and enhance text representations. In an open-domain setting, PriorCLIP designs a two-stage dictionary representation learning strategy consisting of large-scale dictionary training on coarse image-to-text pairs and fine-tuning using visual indicators, enabling robust retrieval of long-tail concepts and lexical variations. Furthermore, we propose a cluster-based symmetric contrastive attribution loss to constrain inter-class relationships and mitigate semantic confusion in a shared embedding space. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves significant performance gains over existing methods: 4.9% and 4.0% in closed-domain retrieval, and 7.3% and 9.4% in open-domain retrieval.

Takeaways, Limitations

Takeaways:
Improving remote sensing image-to-text retrieval performance by proposing a novel visual-language model, PriorCLIP, that leverages visual prior information.
Achieves superior performance over existing methods in both closed and open domain settings.
Effective techniques such as PAE structure, two-stage dictionary representation learning strategy, and cluster-based loss function are presented.
Limitations:
Lack of analysis of the computational cost and complexity of the proposed model.
Generalization performance evaluation on various remote sensing datasets is needed.
Further research is needed on its utility and scalability in real-world applications.
👍