Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Created by
  • Haebom

Author

Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal

Outline

MEXA is a learning-free framework that combines pre-trained expert models to perform scalable multimodal inference across diverse input modalities and complex tasks. To achieve effective multimodal inference across diverse domains, such as medical diagnosis and financial forecasting, MEXA dynamically selects expert models based on input modality and task-specific inference requirements. Each expert model specializes in a specific modality and task pair and generates interpretable, text-based inference outputs. MEXA aggregates and infers these outputs using a large inference model (LRM) to produce a final answer. This modular design enables flexible and transparent multimodal inference across diverse domains without additional training. Across a variety of multimodal benchmarks, including video reasoning, audio reasoning, 3D understanding, and medical QA, MEXA consistently outperforms robust multimodal-based models.

Takeaways, Limitations

A framework that requires no learning and can efficiently handle various multimodal tasks.
Increase accuracy by dynamically selecting expert models based on input method and task-specific requirements.
Provides transparent inference process by generating interpretable text-based inference output.
It showed improved performance compared to existing models in various multimodal benchmarks.
It can be applied to various domains such as medical diagnosis and financial forecasting.
It relies on the performance of expert models, and the quality of the models affects the overall performance of MEXA.
Results may vary depending on the performance and interpretability of the large inference model (LRM).
👍