Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Created by
  • Haebom

Author

Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao

Outline

This paper presents a new benchmark, MMReason, for accurately and comprehensively evaluating the long-chain inference capability of multimodal large-scale language models (MLLMs). To address three major Limitations of the difficulty and lack of diversity of existing benchmarks, the possibility of guessing and memorization, and the lack of evaluation of intermediate inference steps, we construct complex questions of various fields (six academic fields) and various difficulties (from pre-university to university level, from beginner to competitive level) that require multi-step inference. We restructure the questions into an open-ended format and filter them using a multi-model voting technique to eliminate shortcuts such as guessing and memorization. In addition, we annotate step-by-step solutions for each question and design a reference-based 3-way scale to reliably evaluate intermediate inference steps. We use MMReason to benchmark leading MLLMs and provide an in-depth analysis of their inference capabilities.

Takeaways, Limitations

Takeaways:
We provide a new benchmark, MMReason, to accurately and comprehensively evaluate the long-term chain inference capability of MLLM.
You can comprehensively evaluate MLLM's reasoning ability through questions of various fields and difficulty levels.
We present a novel methodology that reduces the possibility of guessing and memorization and evaluates intermediate inference steps.
It provides important research resources for improving the inference ability of MLLM.
Limitations:
MMReason's questions may be biased towards certain areas and levels of difficulty.
Despite filtering using multi-model voting techniques, the possibility of guessing and memorization may not be completely ruled out.
Further validation of the objectivity and reliability of the reference-based three-factor scale may be needed.
👍