Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Created by
  • Haebom

Author

Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Ting Cai, Zibin Zheng

Outline

This paper proposes a system called Krul to solve the problem of efficient state restoration in multi-round conversations of large-scale language models (LLMs). To overcome the limitation of existing KV cache compression methods that apply the same compression method to all conversations, Krul dynamically selects a compression strategy by considering attention pattern similarity across conversations. Key innovations include predictive compression strategy selection, token-wise heterogeneous attention similarity estimation, and a bubble-free restoration scheduler. Experimental results show that Krul reduces TTFT by 1.5x and 2.68x, and KV cache storage by 1.33x and 2.35x, respectively, compared to the best-performing existing methods, while maintaining the same generation quality.

Takeaways, Limitations

Takeaways:
We demonstrate that LLM inference efficiency can be significantly improved by using a dynamic KV cache compression strategy tailored to conversational characteristics.
Contributes to improving the performance and scalability of LLM-based applications by reducing TTFT and KV cache storage capacity.
We present novel techniques such as predictive compression strategy selection, token-wise heterogeneous attention similarity estimation, and bubble-free restoration scheduler.
Limitations:
Krul's performance improvements are based on experimental results for specific datasets and tasks, and generalizability to other environments requires further research.
There may be a computational overhead associated with dynamic compression strategy selection, which may require optimization.
The complexity of the proposed method may make implementation and maintenance difficult.
👍