Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Created by
  • Haebom

Author

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

Outline

This paper presents a novel task, CAPTURe (Counting Amodally for Patterns Through Unseen Regions), to evaluate the ability of a model to infer patterns hidden behind occluded regions. CAPTURe requires the model to count objects by inferring patterns hidden behind occluded regions, assessing both visual pattern recognition and inference. It consists of two versions: CAPTURe-real, which uses real object images, and CAPTURe-synthetic, which uses generated images. We evaluated four powerful VLMs—GPT-4o, Intern-VL2, Molmo, and Qwen2-VL—and found that they performed poorly on both occluded and unoccluded patterns, and that their performance deteriorated even more when occluded. This suggests that VLMs struggle to infer unseen spatial relationships. In contrast, humans showed very low error rates on CAPTURe. Providing additional information about the location of occluded objects improved performance, suggesting that the model's errors stem from both its inability to handle occlusion and its difficulty counting within the image.

Takeaways, Limitations

Takeaways:
We present CAPTURe, a new benchmark for evaluating reasoning ability about occluded objects.
Current powerful VLMs show a lack of inference and spatial understanding of occluded objects.
Suggesting further research directions for improving the performance of VLM (enhancing occluded information inference, visual pattern recognition, and reasoning capabilities).
Presenting future VLM development directions through performance differences between humans and VLM.
Limitations:
The size of the CAPTURe dataset may be limited.
The types of VLM models used in the evaluation may be limited.
It may not fully reflect the complex visual scenes of the real world.
Providing additional information improves performance, demonstrating that the model's error sources are multi-layered, but lacks quantitative analysis of each cause.
👍