Large-scale language models (LLMs) have demonstrated remarkable ability to leverage external knowledge to improve responses in multi-turn and agent applications such as Retrieval-Augmented Generation (RAG). However, processing long context inputs increases system latency and requires significant memory in key-value caches, reducing throughput and creating a fundamental tradeoff between knowledge enrichment and system efficiency. The authors note that a significant portion of the LLM context in RAG consists of phrases connected from retrieval, with only a small portion directly relevant to the query. These phrases often exhibit low semantic similarity due to diversity or deduplication during reranking, resulting in block-diagonal attention patterns that differ from standard LLM generation tasks. Based on this, they argue that most computations on RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, they propose REFRAG, an efficient decoding framework that performs compression, detection, and expansion to improve the latency of RAG applications. By leveraging the sparsity structure, REFRAG accelerates Time-to-First-Token by 30.85x (a 3.75x improvement over existing research) at equivalent confusion levels. Furthermore, through its optimization framework for large-scale contexts, REFRAG can scale the context size of LLM by 16x. We rigorously validate REFRAG on a variety of long-term context tasks, including RAG, multi-turn conversations, and long-document summarization, across diverse datasets. Experimental results demonstrate that REFRAG delivers significant speedups over the LLaMA model and other state-of-the-art baselines without loss of accuracy across a variety of context sizes.