[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

Created by
  • Haebom

Author

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse

Outline

This paper focuses on composite AI systems (e.g., agent systems) where multiple large-scale language models (LLMs) specialized for different users, tasks, and roles work together. In such systems, multiple models often process inputs that share the same contextual prefixes. While previous research has focused on prefix KV cache reuse within a single model, prefix KV cache reuse across different models remains an open challenge. In this paper, we present DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes performing inference on different LLMs. For LLMs with the same architecture, DroidSpeak improves inference performance without loss of quality by recomputing only some layers of the KV cache generated by other LLMs and reusing the rest. Additional performance gains are achieved through careful pipelining of layer-wise recomputations and reused KV cache loading. Experimental results on various datasets and model pairs show that DroidSpeak improves throughput by up to 4x and infill time by about 3.1x, with negligible quality loss in F1, Rouge-L, or code similarity scores.

Takeaways, Limitations

Takeaways:
We present the possibility of improving the performance of distributed LLM inference systems by reusing KV caches across different LLMs.
DroidSpeak presents an effective way to significantly improve throughput and initial fill rate while minimizing quality degradation.
Experimentally verifying the feasibility of KV cache reuse between LLMs of the same architecture.
Limitations:
DroidSpeak is only applicable to LLMs with the same architecture. KV cache reuse between LLMs with different architectures requires further research.
Experiments were limited to specific datasets and model pairs, requiring further validation of generalizability.
Further research is needed on the optimization of the layer-wise recalculation strategy and its applicability to various architectures.
👍