Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Created by
  • Haebom

Author

Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, Sicong Leng

Outline

This paper aims to improve the understanding and reasoning capabilities of multi-modal large language models (MLLMs) for multiple images and text contexts. To this end, we propose a new benchmark, Multi-Image Interleaved Reasoning (MIR), which processes text contexts associated with multiple images together. MIR requires the ability to accurately connect image regions with corresponding text and logically connect information between images. Furthermore, to improve the performance of MLLMs, we present a reasoning step for each instance and a step-by-step curriculum learning strategy that progresses from easy to difficult. Experimental results demonstrate that the proposed method significantly improves the model's reasoning performance on MIR and other benchmarks.

Takeaways, Limitations

Takeaways:
We present a new benchmark, MIR, that improves the ability of MLLMs to understand and reason about multiple images and texts together.
Proposing an inference step and a step-by-step curriculum learning strategy to improve the performance of MLLMs.
Experiments demonstrate the effectiveness of the proposed method and confirm its performance improvement over existing benchmarks.
Facilitating the development of MLLMs' ability to handle complex cross-modal tasks through MIR.
Limitations:
No specific mention of Limitations in the paper. (Only information confirmed in the Abstract is included.)
👍