This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
Created by
Haebom
Author
William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh
Outline
This paper addresses the mathematical problem-solving capabilities of visual-language models (MLLMs), particularly their limitations in geometric reasoning. We evaluated the geometric shape recognition and multi-step reasoning abilities of various MLLMs, and found that their accuracy in recognizing regular polygons was below 50%. We attribute this to MLLMs' reliance on intuitive associations (System 1) and their inability to utilize deliberate reasoning (System 2). This paper proposes a Visually Cued Chain-of-Thought (VC-CoT) prompting technique that explicitly references visual annotations of shapes, and improves the accuracy of GPT-4o in counting irregular polygon sides from 7% to 93%. Our conclusions highlight the importance of visually guided prompting in enhancing MLLMs' System 2 reasoning capabilities.
Takeaways, Limitations
•
Takeaways:
◦
Revealing the limitations of MLLM's visual-mathematical reasoning ability.
◦
MLLM shows difficulty in visual information processing and concept learning.
◦
We propose that the VC-CoT prompting technique can improve visual reasoning performance.
◦
Emphasizes the importance of prompting strategies that effectively utilize visual information.
•
Limitations:
◦
It is possible that only a limited number of geometric problems were addressed.
◦
Further research is needed to determine the generalizability of the VC-CoT prompting technique.
◦
Lack of fundamental solutions to improve System 2 reasoning ability.
◦
Lack of generalizability testing across various MLLM models.