Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Created by
  • Haebom

Author

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao

Outline

This paper addresses the limitations of visual problem solving using image-based tools and reinforcement learning in large-scale multimodal models. Existing open-source approaches are unsuitable for challenging tasks requiring trial-and-error exploration due to their monotonous inference patterns and limited number of interaction turns. To address this, this study presents the Mini-o3 system, which extends tool-based interaction. Mini-o3 performs deep, multi-turn inference across dozens of stages, achieving state-of-the-art performance on challenging visual search tasks. Reproducing OpenAI o3-style behavior involves three key components: First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory inference. Second, we develop an iterative data collection pipeline to obtain cold-start paths exhibiting diverse inference patterns, including depth-first exploration, trial-and-error, and goal-maintaining. Third, we propose an excessive turn masking strategy that prevents penalties for excessive turn responses (those that reach the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency and test-time scalability. Despite being trained with an upper limit of just six interaction turns, the model naturally generates paths that scale to tens of turns during inference, and accuracy improves as the number of turns increases. Extensive experiments demonstrate that Mini-o3 effectively solves challenging visual search problems by generating rich inference patterns and deep thought pathways.

Takeaways, Limitations

Takeaways:
We present Mini-o3, a new system that achieves state-of-the-art performance on demanding visual search tasks.
Deep, multi-turn reasoning capable of exhibiting various reasoning patterns (depth-first search, trial-and-error, goal maintenance, etc.).
Despite the limited number of training turns, the number of turns can be expanded and the accuracy improved during inference.
Introducing a new dataset for exploratory inference: the Visual Probe Dataset.
Improving the efficiency and scalability of reinforcement learning through an over-turn masking strategy.
Limitations:
Further validation of the scale and generalization performance of the Visual Probe Dataset is needed.
Mini-o3's performance may be biased towards certain types of visual search problems.
There is a need to evaluate generalization performance for other types of visual problems or across different modalities.
Further research is needed on the optimization and generalizability of the excessive turn masking strategy.
👍