Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Failure Prediction at Runtime for Generative Robot Policies

ChoirRec: Semantic User Grouping via LLMs for Conversion Rate Prediction of Low-Activity Users

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

DInfer: An Efficient Inference Framework for Diffusion Language Models

Formalizing Style in Personal Narratives

Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

Local MAP Sampling for Diffusion Models

HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting

Native Hybrid Attention for Efficient Sequence Modeling

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Leveraging Large Language Models for Cybersecurity Risk Assessment -- A Case from Forestry Cyber-Physical Systems

TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

Probing the Difficulty Perception Mechanism of Large Language Models

Logistic-Gated Operators Enable Auditable Unit-Aware Thresholds in Symbolic Regression

MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

Curved Boolean Logic: A Contextual Generalization of Propositional Logic with Algorithmic Consequences

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Best of mini-N in-loop Sampling: A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

Diverse Text-to-Image Generation via Contrastive Noise Optimization

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

HAVIR: Hierarchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data

Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

The 2025 OpenAI Preparedness Framework does not guarantee any AI risk mitigation practices: a proof-of-concept for affordance analyzes of AI safety policies

Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models

Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Multi-Modal Manipulation via Multi-Modal Policy Consensus

Graph Your Own Prompt

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

ConQuER: Modular Architectures for Control and Bias Mitigation in IQP Quantum Generative Models

Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation

Tree Search for LLM Agent Reinforcement Learning

Part-of-speech tagging for Nagamese Language using CRF

Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Distribution-Aligned Decoding for Efficient LLM Task Adaptation

ASTREA: Introducing Agentic Intelligence for Orbital Thermal Autonomy

Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

VL Norm: Rethink Loss Aggregation in RLVR

Long-Range Graph Wavelet Networks

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills

DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

REFRAG: Rethinking RAG based Decoding

Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

A Synthetic Dataset for Manometry Recognition in Robotic Applications

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Generative artificial intelligence improves projections of climate extremes

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning

MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

Agentic large language models improve retrieval-based radiology question answering

Vision-Language Cross-Attention for Real-Time Autonomous Driving

Goal-Based Vision-Language Driving

Efficient Compositional Multi-tasking for On-device Large Language Models

Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks

{S\textsuperscript{2}M\textsuperscript{2}}: Scalable Stereo Matching Model for Reliable Depth Estimation

Learning Representations of Event Time Series with Sparse Autoencoders for Anomaly Detection, Similarity Search, and Unsupervised Classification

A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Learning Diffusion Models with Flexible Representation Guidance

Simulating Three-dimensional Turbulence with Physics-informed Neural Networks

Train-before-Test Harmonizes Language Model Rankings

MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection

GradMetaNet: An Equivariant Architecture for Learning on Gradients

PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

The Hidden Link Between RLHF and Contrastive Learning

ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

On Convolutions, Intrinsic Dimensions, and Diffusion Models

Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Ignition Phase: Standard Training for Fast Adversarial Robustness

StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

Monotone and Conservative Policy Iteration Beyond the Tabular Case

REFRAG: Rethinking RAG based Decoding

Created by

Haebom

Author

Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan

REFRAG: An Efficient Decoding Framework for Retrieval-Augmented Generation

Outline

Large-scale language models (LLMs) have demonstrated remarkable ability to leverage external knowledge to improve responses in multi-turn and agent applications such as Retrieval-Augmented Generation (RAG). However, processing long context inputs increases system latency and requires significant memory in key-value caches, reducing throughput and creating a fundamental tradeoff between knowledge enrichment and system efficiency. The authors note that a significant portion of the LLM context in RAG consists of phrases connected from retrieval, with only a small portion directly relevant to the query. These phrases often exhibit low semantic similarity due to diversity or deduplication during reranking, resulting in block-diagonal attention patterns that differ from standard LLM generation tasks. Based on this, they argue that most computations on RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, they propose REFRAG, an efficient decoding framework that performs compression, detection, and expansion to improve the latency of RAG applications. By leveraging the sparsity structure, REFRAG accelerates Time-to-First-Token by 30.85x (a 3.75x improvement over existing research) at equivalent confusion levels. Furthermore, through its optimization framework for large-scale contexts, REFRAG can scale the context size of LLM by 16x. We rigorously validate REFRAG on a variety of long-term context tasks, including RAG, multi-turn conversations, and long-document summarization, across diverse datasets. Experimental results demonstrate that REFRAG delivers significant speedups over the LLaMA model and other state-of-the-art baselines without loss of accuracy across a variety of context sizes.

Takeaways, Limitations

•

Takeaways:

◦

We propose REFRAG, an efficient decoding framework to improve the latency of RAG applications.

◦

Significantly accelerates Time-to-First-Token by leveraging scarcity structures.

◦

Enables scalability of LLM context size.

◦

Demonstrated speedup without loss of accuracy compared to the LLaMA model and other state-of-the-art baselines.

•

Limitations:

◦

There is no direct mention of Limitations in the paper itself. (However, since it is an optimization specific to RAG, its applicability to general LLM tasks may be limited.)

View PDF

Made with Slashpage