[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

Created by
  • Haebom

Author

Shmuel Berman, Jia Deng

Outline

In this paper, we present a novel evaluation method to assess the nonlocal visual reasoning ability of visual language models (VLMs). Nonlocal visual reasoning refers to reasoning that connects evidence collected from multiple regions of an image, and we classify it into three types: comparative perception, leapfrog search, and smooth visual search. Our experiments on state-of-the-art VLMs, including Gemini 2.5 Pro, Claude Vision 3.7, and GPT-o4-mini, show that these models barely surpass random accuracy on simple tasks for humans. This suggests that although VLMs perform well on primitive vision benchmarks, they lack key visual reasoning capabilities. This study provides a structured evaluation set to verify whether VLMs can perform human-like vision algorithms.

Takeaways, Limitations

Takeaways: We demonstrate that current state-of-the-art VLMs have serious limitations in nonlocal visual reasoning ability. Even in simple visual tasks, they fall far short of human-level performance, providing important Takeaways for the future development of VLMs. The evaluation method presented in this study can be a useful tool for objectively evaluating the visual reasoning ability of VLMs.
Limitations: Since this study focuses only on a specific type of nonlocal visual reasoning task, it is difficult to say that it comprehensively evaluates the overall visual reasoning ability of VLMs. It is necessary to expand the evaluation scope by adding various types of visual reasoning tasks. In addition, the characteristics of the image dataset used for evaluation may affect the results.
👍