Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Adaptive KV-Cache Compression without Manually Setting Budget

Created by
  • Haebom

Author

Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang

Outline

This paper proposes GVote, an adaptive KV-cache compression technique, to address the growing memory footprint of KV-caches, which are used to accelerate autoregressive decoding of large-scale language models (LLMs). Unlike existing methods that use a fixed compression ratio, GVote dynamically determines the optimal cache size by predicting the attention demand of future queries through Monte-Carlo sampling. Experiments on various benchmarks, including GSM8K, RULER, and Longbench, demonstrate that GVote reduces memory usage by a factor of two while maintaining comparable or higher accuracy compared to existing methods.

Takeaways, Limitations

Takeaways:
We present a novel adaptive KV-cache compression technique that can significantly improve the efficiency of LLM inference.
Achieve optimal memory-accuracy tradeoffs without having to manually set compression ratios.
Excellent performance verified in various benchmarks.
Limitations:
Monte-Carlo sampling-based prediction methods can increase computational costs.
Further research is needed to determine the generality of the proposed method and its applicability to various LLM architectures.
More extensive experiments are needed, as the experimental results may be limited to specific benchmarks.
👍