Daily Arxiv

전 세계에서 발간되는 인공지능 관련 논문을 정리하는 페이지 입니다.
본 페이지는 Google Gemini를 활용해 요약 정리하며, 비영리로 운영 됩니다.
논문에 대한 저작권은 저자 및 해당 기관에 있으며, 공유 시 출처만 명기하면 됩니다.

Compose Yourself: Average-Velocity Flow Matching for One-Step Speech Enhancement

TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning

Exploring How Audio Effects Alter Emotion with Foundation Models

Listening, Imagining & Refining: A Heuristic Optimized ASR Correction Framework with LLMs

OnlineMate: An LLM-Based Multi-Agent Companion System for Cognitive Support in Online Learning

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models

Fresh in memory: Training-order recency is linearly encoded in language model activations

Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting

Machines are more productive than humans until they aren't, and vice versa

Justice in Judgment: Unveiling (Hidden) Bias in LLM-assisted Peer Reviews

Pun Unintended: LLMs and the Illusion of Humor Understanding

SpecVLM: Fast Speculative Decoding in Vision-Language Models

MALLM: Multi-Agent Large Language Models Framework

Large Language Models for Security Operations Centers: A Comprehensive Survey

Quality Assessment of Tabular Data using Large Language Models and Code Generation

Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis

VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

Stated Preference for Interaction and Continued Engagement (SPICE): Evaluating an LLM's Willingness to Re-engage in Conversation

Statistical Inference for Misspecified Contextual Bandits

TinyDef-DETR: A DETR-based Framework for Defect Detection in Transmission Lines from UAV Imagery

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

PDTrim: Targeted Pruning for Prefill-Decode Disaggregation in Inference

DCA: Graph-Guided Deep Embedding Clustering for Brain Atlases

Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL

A Dynamic Fusion Model for Consistent Crisis Response

Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs

Revealing Hidden Precursors to Earthquakes via a Stress-Sensitive Transformation of Seismic Noise

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

(DEMO) Deep Reinforcement Learning Based Resource Allocation in Distributed IoT Systems

GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning

Agentic AI for Software: thoughts from Software Engineering community

An Efficient Dual-Line Decoder Network with Multi-Scale Convolutional Attention for Multi-organ Segmentation

Retrieval Enhanced Feedback via In-context Neural Error-book

Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

Neuro-inspired Ensemble-to-Ensemble Communication Primitives for Sparse and Efficient ANNs

IPGPhormer: Interpretable Pathology Graph-Transformer for Survival Analysis

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Steering Towards Fairness: Mitigating Political Bias in LLMs

Advancing Knowledge Tracing by Exploring Follow-up Performance Trends

DETACH: Cross-domain Learning for Long-Horizon Tasks via Mixture of Disentangled Experts

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

Tensor-Empowered Asset Pricing with Missing Data

BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability

Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance

AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis

Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

HOTA: Hamiltonian framework for Optimal Transport Advection

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

Search-Optimized Quantization in Biomedical Ontology Alignment

Loss-Complexity Landscape and Model Structure Functions

Latent Policy Steering with Embodiment-Agnostic Pretrained World Models

Automating Steering for Safe Multimodal Large Language Models

Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility

Interpretability-Aware Pruning for Efficient Medical Image Analysis

Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III

Resolving Turbulent Magnetohydrodynamics: A Hybrid Operator-Diffusion Framework

Quantifying Student Success with Generative AI: A Monte Carlo Simulation Informed by Systematic Review

Enhancing Live Broadcast Engagement: A Multi-modal Approach to Short Video Recommendations Using MMGCN and User Preferences

Multi-View Contrastive Learning for Robust Domain Adaptation in Medical Time Series Analysis

"What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding

Progressive Size-Adaptive Federated Learning: A Comprehensive Framework for Heterogeneous Multi-Modal Data Systems

ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System

Beyond Autocomplete: Designing CopilotLens Towards Transparent and Explainable AI Coding Agents

SUA: Stealthy Multimodal Large Language Model Unlearning Attack

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

See What I Mean? CUE: A Cognitive Model of Understanding Explanations

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Speech Recognition on TV Series with Video-guided Post-ASR Correction

Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games

Survey on the Evaluation of Generative Models in Music

Were Residual Penalty and Neural Operators All We Needed for Solving Optimal Control Problems?

Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization

ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge

Diffusion Graph Neural Networks and Dataset for Robust Olfactory Navigation in Hazard Robotics

Cross-Attention Speculative Decoding

From Chat Logs to Collective Insights: Aggregative Question Answering

Less is More: Unlocking Specialization of Time Series Foundation Models via Structured Pruning

Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment

How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control

Does quantization affect models' performance on long-context tasks?

We Need to Measure Data Diversity in NLP -- Better and Broader

Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving

MaskedManipulator: Versatile Whole-Body Manipulation

How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark

Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments

DisastIR: A Comprehensive Information Retrieval Benchmark for Disaster Management

A Risk Ontology for Evaluating AI-Powered Psychotherapy Virtual Agents

The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support

EmoGist: Efficient In-Context Learning for Visual Emotion Understanding

Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance

Created by

Haebom

저자

Songsheng Wang, Rucheng Yu, Zhihang Yuan, Chao Yu, Feng Gao, Yu Wang, Derek F. Wong

개요

본 논문은 Vision-Language-Action (VLA) 모델의 계산 비용을 줄이기 위해 Speculative Decoding (SD) 프레임워크를 적용한 Spec-VLA를 제안합니다. 기존 VLA 모델은 Visual Language Model (VLM)의 큰 파라미터 크기와 autoregressive (AR) 디코딩으로 인해 계산 비용이 높았습니다. Spec-VLA는 action prediction task의 어려움과 VLA 모델의 greedy decoding mechanism으로 인한 한계를 극복하기 위해, action token의 상대 거리를 이용한 효과적인 acceptance relaxation mechanism을 제안합니다. 실험 결과, Spec-VLA는 OpenVLA baseline 대비 성공률 저하 없이 1.42배의 속도 향상을 달성하며, acceptance length를 44% 향상시켰습니다. 이는 VLA 예측 시나리오에서 speculative execution의 잠재력을 보여줍니다.

시사점, 한계점

•

시사점:

◦

VLA 모델의 속도 향상을 위한 효과적인 SD 프레임워크인 Spec-VLA 제시

◦

Action token의 상대 거리를 이용한 acceptance relaxation mechanism의 효용성 검증

◦

OpenVLA baseline 대비 1.42배의 속도 향상 및 44%의 acceptance length 향상 달성

◦

VLA 예측 분야에서 speculative execution의 적용 가능성 제시

•

한계점:

◦

VLA 모델의 greedy decoding mechanism으로 인해 SD의 효과가 제한적일 수 있음.

◦

제안된 acceptance relaxation mechanism의 일반화 가능성에 대한 추가 연구 필요.

◦

다른 VLA 모델이나 더 복잡한 작업에 대한 Spec-VLA의 성능 평가가 필요.

Made with Slashpage