This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
In this paper, we propose LoopServe, a novel framework for efficient acceleration of inference in multi-round conversations in conversational artificial intelligence (LLM). LoopServe presents two innovative approaches to address the difficulties of existing LLMs in handling long contexts in multi-round conversations. First, it performs online sparsification by dynamically selecting important attention matrix parts in the prefilling step. Second, it uses incremental key-value compression in the decoding step to adaptively maintain relevant and efficient caches based on recently generated tokens. In addition, we present a novel benchmark consisting of 11 multi-round datasets that reflect realistic question positions and conversation dependencies. Experimental results show that LoopServe achieves better efficiency than existing baseline models and significantly improves LLM inference speed on various long-context conversation tasks.
Takeaways, Limitations
•
Takeaways:
◦
We present a novel method to effectively improve the inference speed of LLM in multi-round conversations.
◦
More efficient processing is possible through an adaptive approach than traditional fixed or location-based heuristics.
◦
We provide a new benchmark that includes a realistic multi-round conversation dataset.
◦
Demonstrated superior performance of LoopServe on various long-context conversation tasks.
•
Limitations:
◦
Further validation of the generalizability of the proposed benchmark is needed.
◦
LoopServe's performance may depend on specific LLM architectures or datasets.
◦
Further analysis is needed on the complexity of the online sparsification and incremental key-value compression processes.