Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

Created by
  • Haebom

Author

Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li

Outline

This paper presents a novel approach for efficient visual-language understanding of large remote sensing images (RSI). Existing large-scale visual-language models (LVLMs) have limitations in that they use predefined grids that are limited in image processing, resulting in information loss when processing gigapixel RSIs. To address this issue, we propose a text-guided token pruning method that integrates a dynamic image pyramid (DIP). It uses the ability of text recognition region localization through a region-focused module (RFM) to identify important visual tokens, and performs selection and visual token pruning from coarse image tiles to fine image tiles based on the RFM output, thereby reducing computational complexity without directly processing the entire image. In addition, to overcome the limitations of existing LVLMs evaluation benchmarks, we construct a new benchmark LRS-VQA that includes 7,333 QA pairs in eight categories with image lengths up to 27,328 pixels. The proposed method outperforms existing high-resolution strategies on four datasets using the same data, and demonstrates higher efficiency than existing token reduction methods in high-resolution settings. The source code and dataset are available on GitHub (https://github.com/VisionXLab/LRS-VQA) .

Takeaways, Limitations

Takeaways:
A novel method for efficient visual-language understanding of massive remote sensing images is presented.
Reducing computational complexity and minimizing information loss through dynamic image pyramid (DIP) and text guidance token pruning.
We build a new high-resolution RSI question-answering benchmark, LRS-VQA, that overcomes the limitations of existing benchmarks.
Demonstrated superior performance and efficiency compared to existing high-resolution strategies and token reduction methods.
Limitations:
Further validation of the generality and scalability of the LRS-VQA benchmark is needed.
The generalization performance of the proposed method needs to be evaluated for various types of giant RSI.
May be highly dependent on RFM performance.
👍