Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

REFRAG: Rethinking RAG based Decoding

Created by
  • Haebom

Author

Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan

REFRAG: An Efficient Decoding Framework for Retrieval-Augmented Generation

Outline

Large-scale language models (LLMs) have demonstrated remarkable ability to leverage external knowledge to improve responses in multi-turn and agent applications such as Retrieval-Augmented Generation (RAG). However, processing long context inputs increases system latency and requires significant memory in key-value caches, reducing throughput and creating a fundamental tradeoff between knowledge enrichment and system efficiency. The authors note that a significant portion of the LLM context in RAG consists of phrases connected from retrieval, with only a small portion directly relevant to the query. These phrases often exhibit low semantic similarity due to diversity or deduplication during reranking, resulting in block-diagonal attention patterns that differ from standard LLM generation tasks. Based on this, they argue that most computations on RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, they propose REFRAG, an efficient decoding framework that performs compression, detection, and expansion to improve the latency of RAG applications. By leveraging the sparsity structure, REFRAG accelerates Time-to-First-Token by 30.85x (a 3.75x improvement over existing research) at equivalent confusion levels. Furthermore, through its optimization framework for large-scale contexts, REFRAG can scale the context size of LLM by 16x. We rigorously validate REFRAG on a variety of long-term context tasks, including RAG, multi-turn conversations, and long-document summarization, across diverse datasets. Experimental results demonstrate that REFRAG delivers significant speedups over the LLaMA model and other state-of-the-art baselines without loss of accuracy across a variety of context sizes.

Takeaways, Limitations

Takeaways:
We propose REFRAG, an efficient decoding framework to improve the latency of RAG applications.
Significantly accelerates Time-to-First-Token by leveraging scarcity structures.
Enables scalability of LLM context size.
Demonstrated speedup without loss of accuracy compared to the LLaMA model and other state-of-the-art baselines.
Limitations:
There is no direct mention of Limitations in the paper itself. (However, since it is an optimization specific to RAG, its applicability to general LLM tasks may be limited.)
👍