In this paper, we propose AirCache, a novel KV cache compression method to accelerate the inference of large-scale visual language models (LVLMs). LVLMs have excellent inference and generalization capabilities, but they require significant computational costs to process many visual tokens and generate long context outputs, which leads to excessive demands on the KV cache. AirCache systematically investigates the correlations between visual and textual tokens, finds significant redundancy in cached visual tokens, and strategically removes them to significantly accelerate context generation while maintaining model performance. Key components include elite observation windows to assess the importance of visual components, robust inter-modal relevance modeling with enhanced multi-view consistency, and an adaptive layer-by-layer budget allocation strategy that exploits the strength and asymmetry of token importance distributions. Comprehensive evaluations on several LVLMs and benchmarks show that AirCache achieves similar performance compared to the full cache while maintaining only 10% of the visual KV cache, reducing decoding latency by 29% to 66% for various batch sizes and prompt lengths. In particular, as the cache retention ratio decreases, the performance is further improved compared to existing methods.