Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Retrospective Sparse Attention for Efficient Long-Context Generation

Created by
  • Haebom

Author

Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim

Outline

This paper proposes a novel KV cache update technique, RetroAttention, to address the slowdown in inference of large-scale language models (LLMs) in long-text tasks (e.g., inference, code generation, and multi-turn dialogues). Unlike existing KV cache compression methods that primarily focus on input context, RetroAttention addresses accumulated attention errors by updating past attention outputs using newly arrived KV entries during subsequent decoding passes. Maintaining a lightweight output cache allows past queries to efficiently access more relevant contexts while incurring minimal latency overhead. Consequently, it breaks the fixed attention output paradigm and allows for continuous updating of previous approximations. Extensive experiments on long-text generation benchmarks demonstrate that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, improving effective KV exposure by up to 1.6x and accuracy by up to 21.9%.

Takeaways, Limitations

Takeaways:
We present a novel method to effectively address the problem of slow inference speed in LLM in long-text tasks.
Overcoming the limitations of existing KV cache compression methods and simultaneously improving accuracy and efficiency.
Significantly improved the performance of LLM by increasing effective KV exposure and improving accuracy.
A novel approach beyond the fixed attention output paradigm is presented.
Limitations:
Lack of specific details about RetroAttention's lightweight output cache size and management strategy.
Further research is needed to determine generalizability across different LLM architectures and tasks.
Performance and scalability evaluation in real application environments is required.
👍