Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Failure Prediction at Runtime for Generative Robot Policies

ChoirRec: Semantic User Grouping via LLMs for Conversion Rate Prediction of Low-Activity Users

SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management

Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

DInfer: An Efficient Inference Framework for Diffusion Language Models

Formalizing Style in Personal Narratives

Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

Local MAP Sampling for Diffusion Models

HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting

Native Hybrid Attention for Efficient Sequence Modeling

FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Leveraging Large Language Models for Cybersecurity Risk Assessment -- A Case from Forestry Cyber-Physical Systems

TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

Probing the Difficulty Perception Mechanism of Large Language Models

Logistic-Gated Operators Enable Auditable Unit-Aware Thresholds in Symbolic Regression

MADS: Multi-Agent Dialogue Simulation for Diverse Persuasion Data Generation

FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

Curved Boolean Logic: A Contextual Generalization of Propositional Logic with Algorithmic Consequences

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Best of mini-N in-loop Sampling: A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

Diverse Text-to-Image Generation via Contrastive Noise Optimization

PDE-Transformer: A Continuous Dynamical Systems Approach to Sequence Modeling

HAVIR: Hierarchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark

Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

How Effective Are Time-Series Models for Rainfall Nowcasting? A Comprehensive Benchmark for Rainfall Nowcasting Incorporating PWV Data

Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

LaMoGen: Laban Movement-Guided Diffusion for Text-to-Motion Generation

The 2025 OpenAI Preparedness Framework does not guarantee any AI risk mitigation practices: a proof-of-concept for affordance analyzes of AI safety policies

Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models

Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Multi-Modal Manipulation via Multi-Modal Policy Consensus

Graph Your Own Prompt

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

ConQuER: Modular Architectures for Control and Bias Mitigation in IQP Quantum Generative Models

Question-Driven Analysis and Synthesis: Building Interpretable Thematic Trees with LLMs for Text Clustering and Controllable Generation

Tree Search for LLM Agent Reinforcement Learning

Part-of-speech tagging for Nagamese Language using CRF

Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Distribution-Aligned Decoding for Efficient LLM Task Adaptation

ASTREA: Introducing Agentic Intelligence for Orbital Thermal Autonomy

Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

VL Norm: Rethink Loss Aggregation in RLVR

Long-Range Graph Wavelet Networks

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

VendiRL: A Framework for Self-Supervised Reinforcement Learning of Diversely Diverse Skills

DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

REFRAG: Rethinking RAG based Decoding

Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

A Synthetic Dataset for Manometry Recognition in Robotic Applications

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Generative artificial intelligence improves projections of climate extremes

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

A Vision-Language Pre-training Model-Guided Approach for Mitigating Backdoor Attacks in Federated Learning

MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

Agentic large language models improve retrieval-based radiology question answering

Vision-Language Cross-Attention for Real-Time Autonomous Driving

Goal-Based Vision-Language Driving

Efficient Compositional Multi-tasking for On-device Large Language Models

Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks

{S\textsuperscript{2}M\textsuperscript{2}}: Scalable Stereo Matching Model for Reliable Depth Estimation

Learning Representations of Event Time Series with Sparse Autoencoders for Anomaly Detection, Similarity Search, and Unsupervised Classification

A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Learning Diffusion Models with Flexible Representation Guidance

Simulating Three-dimensional Turbulence with Physics-informed Neural Networks

Train-before-Test Harmonizes Language Model Rankings

MLLM-Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection

GradMetaNet: An Equivariant Architecture for Learning on Gradients

PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap

The Hidden Link Between RLHF and Contrastive Learning

ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents

On Convolutions, Intrinsic Dimensions, and Diffusion Models

Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Ignition Phase: Standard Training for Fast Adversarial Robustness

StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery

Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity

VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models

Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

Monotone and Conservative Policy Iteration Beyond the Tabular Case

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Created by

Haebom

Author

Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin

MCPVerse: A Real-World Benchmark for Agentic Tool Use

Outline

This paper introduces MCPVerse, a novel benchmark for evaluating the external tool utilization of large-scale language models (LLMs) that evolve from text generators to inferring agents. MCPVerse integrates over 550 real-world tools, provides a massive action space of over 140,000 tokens, and leverages real-time, answer-based outcome evaluation for time-sensitive tasks. Benchmarking state-of-the-art LLMs in three modes (Oracle, Standard, and Max-Scale) reveals that while most models suffer performance degradation when faced with a larger tool set, agent models such as Claude-4-Sonnet can improve accuracy by leveraging the expanded search space.

Takeaways, Limitations

•

Takeaways:

◦

Presenting a new benchmark to assess agentic tool use skills in LLMs using real-world, actionable tools.

◦

Focus on evaluating the performance of LLM in large tool sets and complex environments.

◦

Certain agent models, such as Claude-4-Sonnet, show improved performance in extended tool set environments.

◦

Establishing a new benchmark for studying agentic tool use.

•

Limitations:

◦

Limitations, as stated in the paper itself, is not provided. (This response is based solely on the paper abstract.)

Made with Slashpage