Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Created by
  • Haebom

Author

Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan

Outline

This paper aims to address the lack of high-quality multimodal trajectories and the cost of manual annotation to improve the effectiveness of Vision Language Models (VLMs), which are increasingly used as controllers for accessing external tools for complex inference and decision-making. To this end, we propose a vision-driven agent tuning framework that automatically synthesizes multimodal trajectories, generates step-by-step preference pairs, and trains a VLM controller for robust tool usage inference. This pipeline first builds M-TRACE, a large-scale dataset consisting of 28,500 multimodal tasks with 177,000 validated trajectories, enabling imitation-based trajectory tuning. Based on this dataset, we develop a MATRIX Agent, a fine-tuned controller on M-TRACE, for step-by-step tool inference. For more precise alignment, we introduce Pref-X, a set of 11,000 automatically generated preference pairs, and optimize MATRIX through step-by-step preference learning. On three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently outperforms both open-source and closed-source VLMs, demonstrating scalable and effective multimodal tooling.

Takeaways, Limitations

Takeaways:
Automatically synthesize multimodal trajectories and generate step-by-step preference pairs to improve the performance of VLM controllers.
Building an efficient tuning pipeline using M-TRACE, MATRIX Agent, and Pref-X.
Agent-X outperforms existing VLMs on GTA and GAIA benchmarks.
Demonstrated ability to use scalable and effective multimodal tools.
Limitations:
The specific Limitations is not stated in the paper (it cannot be determined from the abstract alone).
Further research may be needed to determine the generalization ability of the method proposed in this paper.
Additional experiments may be needed to evaluate applicability in real-world environments.
👍