Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

Created by
  • Haebom

Author

Yunhua Fang, Rui Xie, Asad Ul Haq, Linsen Ma, Kaoutar El Maghraoui, Naigang Wang, Meng Wang, Liu Liu, Tong Zhang

Outline

This paper studies dynamic KV cache placement on a heterogeneous memory system (integrating high-bandwidth memory (HBM) and high-speed off-package DRAM) to address memory bandwidth constraints in large-scale language model (LLM) inference. Specifically, we highlight that despite the sparsity of the attention mechanism, the relevance of past tokens changes over time, necessitating access to the entire KV cache. Rather than proposing a specific scheduling policy, the paper formulates the placement problem through a mathematical model and derives a theoretical upper bound, suggesting potential for runtime optimization. This is the first formal study of dynamic KV cache scheduling for LLM inference on a heterogeneous memory system.

Takeaways, Limitations

Takeaways: Provides a theoretical foundation for improving LLM inference performance using heterogeneous memory systems. It also suggests the potential for runtime optimizations for dynamic KV cache placement optimization.
Limitations: This paper only presents a theoretical upper bound without proposing a specific scheduling policy. Performance evaluation on real systems and the development of specific optimization algorithms are required. Further verification of generalizability across various LLM architectures and memory system configurations is also required.
👍