This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
To address the performance degradation and computational complexity issues in long-text context processing, this paper proposes Dynamic Token-Level KV Cache Selection (TokenSelect), a novel, training-free method. TokenSelect performs attention computation selectively using only significant KV cache tokens based on token-level importance measurements. It reduces selection overhead and improves speed by utilizing a Selection Cache designed based on query similarity observations and an efficient Paged Dot Product Kernel. Experimental results demonstrate superior performance compared to existing methods, with up to a 23.84x increase in attention computation speed and up to a 2.28x reduction in end-to-end latency.
Takeaways, Limitations
•
Takeaways:
◦
An effective method to simultaneously improve the speed and accuracy of long-term context processing without training is presented.
◦
Effectively solves the speed degradation problem of existing long-text context processing methods.
◦
Reduced computational costs through token-level importance measurement and optional KV cache utilization.
•
Limitations:
◦
There is a possibility that the performance of the proposed method may be biased towards specific datasets or models.
◦
The effectiveness of the Selection Cache and Paged Dot Product Kernel may vary depending on the dataset size or model size.
◦
Further experiments are needed on various types of LLM and application areas.