Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Is In-Context Learning Learning?

Towards Understanding Visual Grounding in Visual Language Models

Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss

GeoGPT-RAG Technical Report

Character-Level Perturbations Disrupt LLM Watermarks

STRIDE: Subset-Free Functional Decomposition for XAI in Tabular Settings

Implicit Neural Representations of Intramyocardial Motion and Strain

Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks

Improved Classification of Nitrogen Stress Severity in Plants Under Combined Stress Conditions Using Spatio-Temporal Deep Learning Framework

QualityFM: a Multimodal Physiological Signal Foundation Model with Self-Distillation for Signal Quality Challenges in Critically Ill Patients

Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing

Exploit Tool Invocation Prompt for Tool Behavior Hijacking in LLM-Based Agentic System

From Vision to Validation: A Theory- and Data-Driven Construction of a GCC-Specific AI Adoption Index

AI-in-the-Loop: Privacy Preserving Real-Time Scam Detection and Conversational Scambaiting by Leveraging LLMs and Federated Learning

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation

INGRID: Intelligent Generative Robotic Design Using Large Language Models

From Federated Learning to X-Learning: Breaking the Barriers of Decentrality Through Random Walks

Binary Quantization For LLMs Through Dynamic Grouping

E-PhishGen: Unlocking Novel Research in Phishing Email Detection

EndoGeDE: Generalizable Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes

MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper

Can AI be Auditable?

Principled Approximation Methods for Efficient and Scalable Deep Learning

MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Revision

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

Group Expectation Policy Optimization for Heterogeneous Reinforcement Learning

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection

Quantized Neural Networks for Microcontrollers: A Comprehensive Review of Methods, Platforms, and Applications

Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Comparing Conditional Diffusion Models for Synthesizing Contrast-Enhanced Breast MRI from Pre-Contrast Images

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System

CRoC: Context Refactoring Contrast for Graph Anomaly Detection with Limited Supervision

MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction

A Mixed User-Centered Approach to Enable Augmented Intelligence in Intelligent Tutoring Systems: The Case of MathAIde app

Task-Focused Consolidation with Spaced Recall: Making Neural Networks Learn like College Students

NIRS: An Ontology for Non-Invasive Respiratory Support in Acute Care

A Human-Centered Approach to Identifying Promises, Risks, & Challenges of Text-to-Image Generative AI in Radiology

Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition

Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications

Occlusion-Aware Temporally Consistent Amodal Completion for 3D Human-Object Interaction Reconstruction

Intrinsic Training Signals for Federated Learning Aggregation

Critical Nodes Identification in Complex Networks: A Survey

Low-rank variational dropout: uncertainty and rank selection in adapters

LastingBench: Defend Benchmarks Against Knowledge Leakage

Industrial Energy Disaggregation with Digital Twin-generated Dataset and Efficient Data Augmentation

GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View

Topology-Aware and Highly Generalizable Deep Reinforcement Learning for Efficient Retrieval in Multi-Deep Storage Systems

The Diffusion Duality

Learning Chaotic Dynamics with Neuromorphic Network Dynamics

High-Fidelity Scientific Simulation Surrogates via Adaptive Implicit Neural Representations

Survey on the Evaluation of Generative Models in Music

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

Hopscotch: Discovering and Skipping Redundancies in Language Models

HueManity: Probing Fine-Grained Visual Perception in MLLMs

From Initial Data to Boundary Layers: Neural Networks for Nonlinear Hyperbolic Conservation Laws

Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs

A Convolution and Attention Based Encoder for Reinforcement Learning under Partial Observability

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Semantic Exploration and Dense Mapping of Complex Environments using Ground Robots Equipped with LiDAR and Panoramic Camera

REMS: a unified solution representation, problem modeling and metaheuristic algorithm design for general combinatorial optimization problems

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Balanced and Elastic End-to-end Training of Dynamic LLMs

Learning In Chaos: Efficient Autoscaling and Self-Healing for Multi-Party Distributed Training

TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

Multilingual Collaborative Defense for Large Language Models

An Exploration of Default Images in Text-to-Image Generation

Burger: Robust Graph Denoising-augmentation Fusion and Multi-semantic Modeling in Social Recommendation

Fast Fourier Transform-Based Spectral and Temporal Gradient Filtering for Differential Privacy

Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data

Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving

MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Approaches to Responsible Governance of GenAI in Organizations

MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark

EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations

Dion: Distributed Orthonormalized Updates

Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use

Hallucinated Span Detection with Multi-View Attention Features

Enhancing Traffic Incident Response through Sub-Second Temporal Localization with HybridMamba

Automated detection of atomicity violations in large-scale systems

Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks

YuE: Scaling Open Foundation Models for Long-Form Music Generation

On the Generalization of Representation Uncertainty in Earth Observation

SMT(LIA) Sampling with High Diversity

Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA

LLM as a Broken Telephone: Iterative Generation Distorts Information

Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology

FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

Created by

Haebom

Author

Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Shunian Chen, Qiming Zhu, Le Pan, Minghao Chen, Yuhao Zhang, Li Zhou, Benyou Wang, Haizhou Li

Outline

This paper presents MTalk-Bench, a new benchmark for evaluating the performance of multi-turn speech-to-speech (S2S) large-scale language models (LLMs). MTalk-Bench consists of nine realistic scenarios encompassing three core dimensions—semantic information, vocal information, and ambient noise—and target tasks designed to assess specific abilities, such as reasoning. Evaluation is conducted using a combination of arena-based (pairwise comparison) and rubric-based (absolute scoring) methods, providing both relative and absolute evaluations. Both model and human outputs are evaluated by human raters and LLMs. Experimental results show that S2S LLMs excel at processing semantic information but struggle with recognizing vocal information and ambient noise. They also demonstrate a tendency to increase response length to restore consistency, but at a reduced efficiency. Furthermore, modality-aware and task-specific design outperform simple scaling. Finally, we analyze the reliability and limitations of the proposed evaluation framework.

Takeaways, Limitations

•

Takeaways:

◦

Introducing MTalk-Bench, a new benchmark for multi-session S2S LLM evaluations.

◦

S2S LLM has excellent semantic information processing ability, but it lacks the ability to process voice information and ambient noise.

◦

We found that increasing response length contributes to improved consistency but reduces efficiency.

◦

Emphasize the importance of modality awareness and task-specific design.

◦

Presenting the possibility of complementary assessment between the arena method and the rubric method.

•

Limitations:

◦

Arena and rubric methods only provide consistent results when performance differences are significant.

◦

When using LLM as an evaluator, it matches human evaluators only when there are clear differences or explicit criteria.

◦

LLM evaluators exhibit position and length bias, and nonverbal assessments are only reliable when accompanied by textual annotations.

◦

The need for a more robust evaluation framework that takes into account speech recognition capabilities is raised.

View PDF

Made with Slashpage