Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

Created by
  • Haebom

Author

Youngmin Kim, Jiacheng Li, Kijae Hong, Anastasia Ailamaki

Outline

LLM is widely used worldwide for tasks ranging from everyday tasks to agent systems and data analysis, and requires significant GPU resources. However, LLM inference systems are slow compared to database systems, and inference performance and mechanisms are often considered a black box, limiting the scalability of LLM within databases and other performance-critical applications. This paper analyzes LLM inference performance and focuses on data management issues within LLM inference. In particular, we find that there is a lack of an appropriate resource cost model and optimization strategy for scheduling requests with intermediate results cached in GPU memory when executing concurrent inference requests. In this paper, we develop a cost model for concurrent inference requests and a novel cache replacement policy tailored to LLM inference, which can significantly reduce GPU costs by applying classic database techniques.

Takeaways, Limitations

Takeaways:
Focuses on solving data management problems to improve the performance of LLM inference systems.
Development of a cost model and cache replacement policy for processing concurrent inference requests.
Suggesting the possibility of reducing GPU costs.
Limitations:
The paper does not specify specific experimental results or the extent of performance improvement (though this can be inferred).
Lack of details on the practical implementation and application of the proposed technology.
Lack of comparative analysis with other LLM inference systems (possible to infer).
👍