Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Created by
  • Haebom

Author

Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong

Outline

To address the performance degradation and computational complexity issues in long-text context processing, this paper proposes Dynamic Token-Level KV Cache Selection (TokenSelect), a novel, training-free method. TokenSelect performs attention computation selectively using only significant KV cache tokens based on token-level importance measurements. It reduces selection overhead and improves speed by utilizing a Selection Cache designed based on query similarity observations and an efficient Paged Dot Product Kernel. Experimental results demonstrate superior performance compared to existing methods, with up to a 23.84x increase in attention computation speed and up to a 2.28x reduction in end-to-end latency.

Takeaways, Limitations

Takeaways:
An effective method to simultaneously improve the speed and accuracy of long-term context processing without training is presented.
Effectively solves the speed degradation problem of existing long-text context processing methods.
Reduced computational costs through token-level importance measurement and optional KV cache utilization.
Limitations:
There is a possibility that the performance of the proposed method may be biased towards specific datasets or models.
The effectiveness of the Selection Cache and Paged Dot Product Kernel may vary depending on the dataset size or model size.
Further experiments are needed on various types of LLM and application areas.
👍