Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Created by
  • Haebom

Author

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu

Outline

Mooncake is a serving platform for Kimi, the main LLM service provided by Moonshot AI. Mooncake features a KVCache-centric distributed architecture that separates prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of GPU clusters to implement a distributed KVCache cache. At the heart of Mooncake is a KVCache-centric scheduler that maximizes overall effective throughput while meeting latency-related service-level objectives (SLOs). Contrary to existing research that assumes all requests will be processed, Mooncake struggles under overload scenarios. To mitigate this, we developed a prediction-based early rejection policy. Experimental results show that Mooncake outperforms long-context scenarios. Compared to baseline methods, Mooncake can increase throughput by up to 525% in certain simulated scenarios while meeting SLOs. Under real-world workloads, Mooncake's innovative architecture enables Kimi to handle up to 75% more requests.

Takeaways, Limitations

Takeaways:
We demonstrate that a KVCache-centric distributed architecture can significantly improve the throughput of an LLM serving platform.
Effectively utilize underutilized resources of GPU clusters to improve system efficiency.
Ensure system stability in overload scenarios through predictive early rejection policies.
It performs well in long context scenarios.
Limitations:
Further analysis is needed to determine the discrepancy between simulation results and actual workload results.
Further research is needed on the accuracy and optimization potential of prediction-based early rejection policies.
Long-term performance and stability evaluation in actual operating environments is required.
Generalizability verification is needed for various LLM models and workloads.
👍