This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee
Outline
As the inference optimization of long-context Large Language Models (LLMs) becomes increasingly important due to the quadratic computational complexity and linear memory complexity of Transformers, this paper proposes a novel framework that utilizes small-scale draft models to more accurately predict the importance of tokens and KV pairs, improving the existing approximate methods (such as key-value (KV) cache elimination, sparse attention, and prompt compression) that roughly predict the importance of tokens or KV pairs. Specifically, we present two methods, SpecKV, which utilizes the output of the draft model to accurately estimate the importance of each KV pair and perform KV cache elimination more effectively, and SpecPC, which uses the attention activation of the draft model to identify and evict unimportant prompt tokens. Theoretical and experimental analyses demonstrate the validity of our methods, and show a strong correlation between the attention patterns of the draft model and the target model. Extensive experiments on long-context benchmarks show that our method consistently outperforms existing baseline models while maintaining improvements in memory usage, latency, and throughput. The source code can be found at https://github.com/furiosa-ai/draft-based-approx-llm .
We present a novel framework that can significantly improve the efficiency of long-context LLM inference by leveraging small-scale draft models.
◦
SpecKV and SpecPC achieve higher accuracy than existing approximation methods while simultaneously improving memory usage, latency, and throughput.
◦
The validity of the method was verified by analyzing the correlation of attention patterns between the draft model and the target model.
◦
Reproducibility can be ensured through open source code and can contribute to the development of other researchers.
•
Limitations:
◦
The performance of the draft model may affect the performance of the final model. Further research is needed on the design and training method of the draft model.
◦
The effectiveness of the proposed method may depend on specific datasets and models. Additional experiments on various datasets and models are needed.
◦
There may be additional computational overhead in the draft model, and research is needed to minimize it.