Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Visual Structures Help Visual Reasoning: Addressing the Binding Problem in VLMs

Created by
  • Haebom

Author

Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah

Outline

This paper proposes VISER (Visual Input Structure for Enhanced Reasoning) to address the limitations of visual language models (VLMs) in their visual reasoning capabilities. VLMs struggle to reliably connect perceptual features with visual referents, leading to errors in tasks such as computation, visual search, scene description, and spatial relationship understanding. VISER is a simple yet effective method for augmenting visual input with low-level spatial structure and adding text prompts that guide sequential and spatially aware parsing. Experimental results demonstrate that VISER significantly improves the performance of various visual reasoning tasks. Specifically, it improves visual search accuracy by 25.00% and computation accuracy by 26.83% on GPT-4o, reduces edit distance error in scene description by 0.32, and improves spatial relationship performance on a 2D synthetic dataset by 9.50%. This highlights the importance of visual input design over purely linguistic approaches and suggests that low-level visual structuring represents a powerful and unexplored direction for enhancing constructive visual reasoning.

Takeaways, Limitations

Takeaways:
We suggest that low-level visual structuring is an effective way to improve the visual reasoning ability of VLM.
It emphasizes the importance of visual input design over purely language-based approaches.
VISER demonstrates its efficiency by improving the binding problem with only a single query inference.
We achieved performance improvements across a variety of visual reasoning tasks, including visual search, computation, scene description, and spatial relationship understanding.
Limitations:
Currently, only results for 2D synthetic datasets are presented, and further research is needed to determine generalizability to real-world datasets.
There is a lack of analysis on the computational cost and scalability of the proposed method.
Further research is needed to determine generalizability across different VLM architectures.
👍