[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Created by
  • Haebom

Author

Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan

Outline

In this paper, we propose LoopServe, a novel framework for efficient acceleration of inference in multi-round conversations in conversational artificial intelligence (LLM). LoopServe presents two innovative approaches to address the difficulties of existing LLMs in handling long contexts in multi-round conversations. First, it performs online sparsification by dynamically selecting important attention matrix parts in the prefilling step. Second, it uses incremental key-value compression in the decoding step to adaptively maintain relevant and efficient caches based on recently generated tokens. In addition, we present a novel benchmark consisting of 11 multi-round datasets that reflect realistic question positions and conversation dependencies. Experimental results show that LoopServe achieves better efficiency than existing baseline models and significantly improves LLM inference speed on various long-context conversation tasks.

Takeaways, Limitations

Takeaways:
We present a novel method to effectively improve the inference speed of LLM in multi-round conversations.
More efficient processing is possible through an adaptive approach than traditional fixed or location-based heuristics.
We provide a new benchmark that includes a realistic multi-round conversation dataset.
Demonstrated superior performance of LoopServe on various long-context conversation tasks.
Limitations:
Further validation of the generalizability of the proposed benchmark is needed.
LoopServe's performance may depend on specific LLM architectures or datasets.
Further analysis is needed on the complexity of the online sparsification and incremental key-value compression processes.
👍