Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning

MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training

R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning

DiTraj: training-free trajectory control for video diffusion transformer

Agribot: agriculture-specific question answer system

$\Mathbf{Li_2}$: A Framework on Dynamics of Feature Emergence and Delayed Generalization

Dual-Head Reasoning Distillation: Improving Classifier Accuracy with Train-Time-Only Reasoning

Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence

Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy

Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

SiNGER: A Clearer Voice Distills Vision Transformers Further

I-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents

Experience Deploying Containerized GenAI Services at an HPC Center

EmbeddingGemma: Powerful and Lightweight Text Representations

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation

Responsible AI Technical Report

Diffusion-Based Impedance Learning for Contact-Rich Manipulation Tasks

Diversity Boosts AI-Generated Text Detection

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation

Self-Evolving LLMs via Continual Instruction Tuning

Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory

Joint Memory Frequency and Computing Frequency Scaling for Energy-efficient DNN Inference

StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions

Accurate and Efficient Low-Rank Model Merging in Core Space

Patterns in the Transition From Founder-Leadership to Community Governance of Open Source

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning

Evaluating undergraduate mathematics examinations in the era of generative AI: a curriculum-level case study

Learning to Route: Per-Sample Adaptive Routing for Multimodal Multitask Prediction

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication

A Systematic Survey on Large Language Models for Evolutionary Optimization: From Modeling to Solving

DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum

Diffusion Generative Models Meet Compressed Sensing, with Applications to Imaging and Finance

Co-Evolving Complexity: An Adversarial Framework for Automatic MARL Curricula

Grocery to General Merchandise: A Cross-Pollination Recommender using LLMs and Real-Time Cart Context

Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions

GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping

Can General-Purpose Omnimodels Compete with Specialists? A Case Study in Medical Image Segmentation

Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

TReF-6: Inferring Task-Relevant Frames from a Single Demonstration for One-Shot Skill Generalization

Evaluating the Effectiveness of Transformer Layers in Wav2Vec 2.0, XLS-R, and Whisper for Speaker Identification Tasks

End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Automatic Question & Answer Generation Using Generative Large Language Model (LLM)

CORE-RAG: Lossless Compression for Retrieval-Augmented LLMs via Reinforcement Learning

What Matters in Data for DPO?

Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data

Speculative Safety-Aware Decoding

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Contrastive Representations for Temporal Reasoning

Semantic Discrepancy-aware Detector for Image Forgery Identification

G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

PakBBQ: A Culturally Adapted Bias Benchmark for QA

MoQE: Improve Quantization Model performance via Mixture of Quantization Experts

Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models

Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

CTTS: Collective Test-Time Scaling

The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet

Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

Can Language Models Discover Scaling Laws?

When Engineering Outruns Intelligence: Rethinking Instruction-Guided Navigation

A Markov Categorical Framework for Language Modeling

Moving Out: Physically-grounded Human-AI Collaboration

GLANCE: Graph Logic Attention Network with Cluster Enhancement for Heterophilous Graph Representation Learning

The Ever-Evolving Science Exam

Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

GRID: Scalable Task-Agnostic Prompt-Based Continual Learning for Language Models

Learning to summarize user information for personalized reinforcement learning from human feedback

Making Language Model a Hierarchical Classifier

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

BenchRL-QAS: Benchmarking reinforcement learning algorithms for quantum architecture search

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

Mitigating Watermark Forgery in Generative Models via Randomized Key Selection

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

PRIME: Large Language Model Personalization with Cognitive Dual-Memory and Personalized Thought Process

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Latent Chain-of-Thoughts? Decoding the Depth-Recurrent Transformer

Empirical Analysis Of Heuristic and Approximation Algorithms for the Mutual-Visibility Problem

Learning to Segment for Vehicle Routing Problems

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Semantic-guided Diverse Decoding for Large Language Models

Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Created by

Haebom

Author

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

Outline

This paper aims to improve the performance of knowledge-intensive visual question answering (VQA) using multimodal large-scale language models (MLLMs). To overcome the limitations of conventional single-pass retrieval augmented generation (RAG) methods, we propose a multimodal iterative RAG framework (MI-RAG), which leverages inference to improve retrieval and integrates knowledge synthesis. MI-RAG iteratively generates multiple queries, retrieves diverse knowledge, and synthesizes them to deepen understanding. Benchmark experiments on Encyclopedic VQA, InfoSeek, and OK-VQA demonstrate that MI-RAG significantly improves both retrieval and response accuracy.

Takeaways, Limitations

•

Takeaways:

◦

A novel approach to solving knowledge-intensive VQA problems (MI-RAG framework).

◦

Improve model comprehension through iterative inference and knowledge synthesis.

◦

Demonstrated improved performance compared to existing models in various benchmarks.

◦

Building a scalable framework for knowledge-intensive VQA.

•

Limitations:

◦

Further explanation of the specific framework implementation and computational complexity is needed.

◦

Further research is needed to determine the generalizability of MI-RAG and its applicability to other multimodal problems.

◦

Absence of specifics on knowledge base selection and management strategies.

Made with Slashpage