Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Created by
  • Haebom

Author

Jingqi Zhou, Sheng Wang, Jingwei Dong, Kai Liu, Lei Li, Jiahui Gao, Jiyue Jiang, Lingpeng Kong, Chuan Wu

Outline

Large-Scale Vision-Language Models (LVLMs) have shown significant progress in visual understanding tasks, but suffer from performance degradation in visual reasoning tasks due to their prioritization of language knowledge over image information. To address this issue, we identified the shortcomings of existing solutions (limited multimodal inference capabilities and insufficient and irrelevant visual explanations). We then introduced a novel visual inference framework, ProReason, by dividing the visual inference process into two stages: active visual recognition (Vision) and textual inference (Wisdom). This framework features decoupled visual-inference capabilities and multi-step active recognition. ProReason iterates through active information gathering and inference until it can answer a given multimodal question with sufficient visual explanations. Specifically, this separation of capabilities enables seamless integration with existing Large-Scale Language Models (LLMs), thereby addressing the inference deficiencies of LVLMs. Extensive experiments have shown that ProReason outperforms existing multi-stage inference frameworks across a variety of benchmarks, achieving an average performance improvement of 13.2%. Furthermore, through LLM integration, ProReason generates high-quality visual reasoning data, enabling ProReason-distillation models (ProReason-VL and ProReason-Q3) to achieve superior performance in downstream tasks. Insights into existing solutions and a separate perspective for feasible LLM integration will contribute to future research on visual reasoning technologies, particularly those supporting LLM.

Takeaways, Limitations

The ProReason framework is designed to separate active visual recognition and text reasoning to enhance the visual reasoning capabilities of LVLM.
Integration with LLM further improves performance and generates high-quality visual inference data.
It outperforms existing frameworks in various benchmarks (up 13.2% on average).
We suggest future research directions for LLM-based visual reasoning techniques.
The specific Limitations is not mentioned in the paper.
👍