Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

Created by
  • Haebom

Author

Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang

Outline

This paper presents Jigsaw-Puzzles, a new benchmark for evaluating the spatial reasoning capabilities of vision-language models (VLMs). Jigsaw-Puzzles consists of 1,100 real-world images with high spatial complexity and includes five tasks assessing spatial perception, structure understanding, and reasoning. When evaluated against 24 state-of-the-art VLMs, even the top-performing model, Gemini-2.5-Pro, achieved only 77.14% overall accuracy, and in particular, only 30% accuracy in the sequence generation task, significantly lower than the over 90% performance of human participants. This highlights the need for continued research to improve the spatial reasoning capabilities of VLMs.

Takeaways, Limitations

Takeaways:
A new benchmark (Jigsaw-Puzzles) for objectively evaluating the spatial reasoning capabilities of VLMs is presented.
Clearly demonstrating the limitations of the spatial reasoning capabilities of state-of-the-art VLMs.
Suggesting directions for spatial reasoning research in VLMs (especially the need to improve performance in sequence generation tasks)
Limitations:
The size of the Jigsaw-Puzzles dataset can be relatively small.
The types of assessment tasks may be limited.
It may not perfectly reflect the various spatial situations in the real world.
👍