This paper aims to address the lack of high-quality multimodal trajectories and the cost of manual annotation to improve the effectiveness of Vision Language Models (VLMs), which are increasingly used as controllers for accessing external tools for complex inference and decision-making. To this end, we propose a vision-driven agent tuning framework that automatically synthesizes multimodal trajectories, generates step-by-step preference pairs, and trains a VLM controller for robust tool usage inference. This pipeline first builds M-TRACE, a large-scale dataset consisting of 28,500 multimodal tasks with 177,000 validated trajectories, enabling imitation-based trajectory tuning. Based on this dataset, we develop a MATRIX Agent, a fine-tuned controller on M-TRACE, for step-by-step tool inference. For more precise alignment, we introduce Pref-X, a set of 11,000 automatically generated preference pairs, and optimize MATRIX through step-by-step preference learning. On three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently outperforms both open-source and closed-source VLMs, demonstrating scalable and effective multimodal tooling.