Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment

DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models

Computational Concept of the Psyche (in Russian)

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

The Efficiency Frontier: Classical Shadows versus Quantum Footage

BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models

Uncertainty Quantification in Probabilistic Machine Learning Models: Theory, Methods, and Insights

CURE: Controlled Unlearning for Robust Embeddings - Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Revealing Hidden Precursors to Earthquakes via a Stress-Sensitive Transformation of Seismic Noise

ASE: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Subjective Behaviors and Preferences in LLM: Language of Browsing

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Self-Questioning Language Models

MetaExplainer: A Framework to Generate Multi-Type User-Centered Explanations for AI Systems

How Should We Meta-Learn Reinforcement Learning Algorithms?

Comprehensive Evaluation of Prototype Neural Networks

HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation

CyberRAG: An Agentic RAG cyber attack classification and reporting tool

Multi-Timescale Hierarchical Reinforcement Learning for Unified Behavior and Control of Autonomous Driving

A Nonlinear Low-rank Representation Model with Convolutional Neural Network for Imputing Water Quality Data

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Discrete Diffusion in Large Language and Multimodal Models: A Survey

From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

How Far Are We from Optimal Reasoning Efficiency?

Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations

Stopping Criteria for Value Iteration on Concurrent Stochastic Reachability and Safety Games

Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

Prior Prompt Engineering for Reinforcement Fine-Tuning

Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features

CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models

TransitReID: Transit OD Data Collection with Occlusion-Resistant Dynamic Passenger Re-Identification

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

A decision-theoretic approach to dealing with uncertainty in quantum mechanics

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging

UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video Compression

MPO: Boosting LLM Agents with Meta Plan Optimization

Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension

A general language model for peptide identification

Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation

CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?

Traffic-Rule-Compliant Trajectory Repair via Satisfiability Modulo Theories and Reachability Analysis

QR-VC: Leveraging Quantization Residuals for Linear Disentanglement in Zero-Shot Voice Conversion

Generative AI for Data Augmentation in Wireless Networks: Analysis, Applications, and Case Study

Neural-Enhanced Dynamic Range Compression Inversion: A Hybrid Approach for Restoring Audio Dynamics

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

A Transformer approach for Electricity Price Forecasting

FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models

PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation

HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

Towards explainable decision support using hybrid neural models for logistic terminal automation

BlendedNet: A Blended Wing Body Aircraft Dataset and Surrogate Model for Aerodynamic Predictions

That's So FETCH: Fashioning Ensemble Techniques for LLM Classification in Civil Legal Intake and Referral

Murphys Laws of AI Alignment: Why the Gap Always Wins

Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems

Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning

Understanding visual attention beehind bee-inspired UAV navigation

Working with AI: Measuring the Applicability of Generative AI to Occupations

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Context-Driven Knowledge Graph Completion with Semantic-Aware Relational Message Passing

Meta-Semantics Augmented Few-Shot Relational Learning

Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

Associative Knowledge Graphs for Efficient Sequence Storage and Retrieval

Depth-Bounded Epistemic Planning

A Survey of Reinforcement Learning for Large Reasoning Models

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

QCardEst/QCardCorr: Quantum Cardinality Estimation and Correction

Merge-of-Thought Distillation

MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Scaling Truth: The Confidence Paradox in AI Fact-Checking

PianoVAM: A Multimodal Piano Performance Dataset

An End-to-End Deep Learning Framework for Arsenicosis Diagnosis Using Mobile-Captured Skin Images

Using AI to Optimize Patient Transfer and Resource Utilization During Mass-Casualty Incidents: A Simulation Platform

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

Learning Turbulent Flows with Generative Models: Super-resolution, Forecasting, and Sparse Flow Reconstruction

FinZero: Launching Multi-modal Financial Time Series Forecast with Large Reasoning Model

DEQuify your force field: More efficient simulations using deep equilibrium models

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Explainability of CNN Based Classification Models for Acoustic Signal

TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals

A layered architecture for log analysis in complex IT systems

Reshaping the Forward-Forward Algorithm with a Similarity-Based Objective

Skeleton-based sign language recognition using a dual-stream spatio-temporal dynamic graph convolutional network

Robust Belief-State Policy Learning for Quantum Network Routing Under Decoherence and Time-Varying Conditions

Architecting Resilient LLM Agents: A Guide to Secure Plan-then-Execute Implementations

RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts

UOPSL: Unpaired OCT Predilection Sites Learning for Fundus Image Diagnosis Augmentation

OTESGN:Optimal Transport Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis

Classification of 24-hour movement behaviors from wrist-worn accelerometer data: from handcrafted features to deep learning techniques

Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Interpretability as Alignment: Making Internal Understanding a Design Principle

MESH -- Understanding Videos Like Humans: Measuring Hallucinations in Large Video Models

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Created by

Haebom

Author

Mohamed Salim Aissi, Clemence Grislain, Mohamed Chetouani, Olivier Sigaud, Laure Soulier, Nicolas Thome

Outline

This paper introduces VIPER, a novel multimodal framework for visually guided planning. VIPER integrates perception based on a Vision-Language Model (VLM) and inference based on a Large Language Model (LLM). It utilizes a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then used by an LLM policy to predict actions based on the task objective. Action replication and reinforcement learning are used to fine-tune the inference module to enhance the agent's decision-making capabilities. Experimental results on the ALFWorld benchmark demonstrate that VIPER significantly outperforms state-of-the-art visually guided planners and closes the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER enhances explainability and paves the way for fine-grained analysis of the perception and inference components.

Takeaways, Limitations

•

Takeaways:

◦

We present a novel framework that effectively solves visually directed planning problems by integrating VLM and LLM.

◦

Using text as an intermediate representation to improve the explainability of models and facilitate analysis of perception/inference processes.

◦

Performance improvement over previous top-performing models in the ALFWorld benchmark.

◦

Improving agent decision-making through action replication and reinforcement learning.

•

Limitations:

◦

Due to the dependence on the ALFWorld benchmark, generalization performance in other environments requires further verification.

◦

Further research is needed to address potential performance degradation and efficiency issues that may arise during the integration of VLM and LLM.

◦

A performance gap still exists with pure text-based Oracle.

View PDF

Made with Slashpage