Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Created by
  • Haebom

Author

Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan

Outline

This paper presents LoopServe, an adaptive dual-phase inference acceleration framework for efficient inference of large-scale language models (LLMs) in multi-turn conversations. LoopServe efficiently manages conversational context by introducing online sparsification with dynamic importance selection (in the prefilling phase) and adaptive key-value compression (in the decoding phase). Furthermore, we propose a new benchmark consisting of 11 multi-turn datasets that reflect realistic query positions and conversation dependencies, and experimentally demonstrate that LoopServe outperforms existing methods in terms of performance and acceleration.

Takeaways, Limitations

Takeaways:
We present a novel framework to improve the inference speed and efficiency of multi-turn conversational LLM.
Implement dynamic context management through online sparsification and adaptive key-value compression.
Presenting an acceleration methodology that can adapt to real-world conversation patterns.
A new benchmark proposal covering diverse multi-turn datasets.
Demonstrated superior performance and acceleration compared to existing methodologies.
Limitations:
Lack of detailed information on the specific numerical values of the acceleration effects and the degree of performance improvement presented in the paper.
Absence of discussion of LoopServe's complexity and implementation difficulties.
Lack of specification of dependencies on specific LLM architectures and hardware environments.
Further research is needed on generalizability across different domains and tasks.
👍