Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimization

The Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

DB-FGA-Net: Dual Backbone Frequency Gated Attention Network for Multi-Class Brain Tumor Classification with Grad-CAM Interpretability

Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset

On the Structure of Stationary Solutions to McKean-Vlasov Equations with Applications to Noisy Transformers

ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models

Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning

Imbalanced Gradients in RL Post-Training of Multi-Task LLMs

What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in OmniModels

ADPO: Anchored Direct Preference Optimization

Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model

MIN-Merging: Merge the Important Neurons for Model Merging

When Intelligence Fails: An Empirical Study on Why LLMs Struggle with Password Cracking

From Flows to Words: Can Zero-/Few-Shot LLMs Detect Network Intrusions? A Grammar-Constrained, Calibrated Evaluation on UNSW-NB15

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection

UNDREAM: Bridging Differentiable Rendering and Photorealistic Simulation for End-to-end Adversarial Attacks

The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Evidence Without Injustice: A New Counterfactual Test for Fair Algorithms

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Automatic Music Sample Identification with Multi-Track Contrastive Learning

DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models

Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

Uncovering Singularities in Feynman Integrals via Machine Learning

Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

Token Is All You Price

LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

IKNet: Interpretable Stock Price Prediction via Keyword-Guided Integration of News and Technical Indicators

Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Feasibility-Aware Decision-Focused Learning for Predicting Parameters in the Constraints

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Holistic Order Prediction in Natural Scenes

Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Aligning LLMs for Multilingual Consistency in Enterprise Applications

Open-Vocabulary Spatio-Temporal Scene Graph for Robot Perception and Teleoperation Planning

Automatic Discovery of One Parameter Subgroups of $SO(n)$

Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

WolBanking77: Wolof Banking Speech Intent Classification Dataset

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise

EvoBrain: Dynamic Multi-Channel EEG Graph Modeling for Time-Evolving Brain Networks

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning

Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers

Membership Inference Attacks on Recommender System: A Survey

Reconstruction Alignment Improves Unified Multimodal Models

Deriving Transformer Architectures as Implicit Multinomial Regression

The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation

The Role of AI in Facilitating Interdisciplinary Collaboration: Evidence from AlphaFold

Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery

TaoSR1: The Thinking Model for E-commerce Relevance Search

A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing

The ISLab Solution to the Algonauts Challenge 2025: A Multimodal Deep Learning Approach to Brain Response Prediction

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

BikeBench: A Bicycle Design Benchmark for Generative Models with Objectives and Constraints

Trusted Knowledge Extraction for Operations and Maintenance Intelligence

CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models

ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

A Lightweight Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes

Ground-Compose-Reinforce: Grounding Language in Agentic Behaviors using Limited Data

Through the River: Understanding the Benefits of Schedule-Free Methods for Language Model Training

Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis

Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Unifying Re-Identification, Attribute Inference, and Data Reconstruction Risks in Differential Privacy

Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG

Deep Learning Atmospheric Models Reliably Simulate Out-of-Sample Land Heat and Cold Wave Frequencies

ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

Echo State Transformer: Attention Over Finite Memories

Reasoning as an Adaptive Defense for Safety

Curious Causality - Seeking Agents Learn Meta Causal World

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

FlightKooba: A Fast Interpretable FTP Model

Thought Anchors: Which LLM Reasoning Steps Matter?

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Identifiability of Deep Polynomial Neural Networks

Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment

Distributional Training Data Attribution: What do Influence Functions Sample?

KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills

Unsupervised Document and Template Clustering using Multimodal Embeddings

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Created by

Haebom

Author

Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal

Outline

MEXA is a learning-free framework that combines pre-trained expert models to perform scalable multimodal inference across diverse input modalities and complex tasks. To achieve effective multimodal inference across diverse domains, such as medical diagnosis and financial forecasting, MEXA dynamically selects expert models based on input modality and task-specific inference requirements. Each expert model specializes in a specific modality and task pair and generates interpretable, text-based inference outputs. MEXA aggregates and infers these outputs using a large inference model (LRM) to produce a final answer. This modular design enables flexible and transparent multimodal inference across diverse domains without additional training. Across a variety of multimodal benchmarks, including video reasoning, audio reasoning, 3D understanding, and medical QA, MEXA consistently outperforms robust multimodal-based models.

Takeaways, Limitations

•

A framework that requires no learning and can efficiently handle various multimodal tasks.

•

Increase accuracy by dynamically selecting expert models based on input method and task-specific requirements.

•

Provides transparent inference process by generating interpretable text-based inference output.

•

It showed improved performance compared to existing models in various multimodal benchmarks.

•

It can be applied to various domains such as medical diagnosis and financial forecasting.

•

It relies on the performance of expert models, and the quality of the models affects the overall performance of MEXA.

•

Results may vary depending on the performance and interpretability of the large inference model (LRM).

View PDF

Made with Slashpage