This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
More Thoughts, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Created by
Haebom
Author
Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang
Outline
This paper explores extending the inference capabilities of large-scale language models (LLMs) to visual language models (VLMs). Specifically, we discover "visual forgetting," a phenomenon that impairs visual recognition during the inference process. To address this, we propose "Visual-Based Policy Optimization (VAPO)." VAPO induces greater reliance on visual information during the inference process, achieving new state-of-the-art performance on various benchmarks.
Takeaways, Limitations
•
Takeaways:
◦
We raise the issue of declining visual recognition ability along with the improvement of VLM's reasoning ability and propose a new methodology to solve this problem.
◦
Improve the performance of VLM by increasing the reliance on visual information through VAPO.
◦
Achieving new peak performance across a variety of visual tasks.
•
Limitations:
◦
The specific Limitations is not specified in the paper.