Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Created by
  • Haebom

Author

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi

Outline

OmniSpatial is a comprehensive and challenging spatial reasoning benchmark based on cognitive psychology. It consists of four major categories—dynamic reasoning, complex spatial logic, spatial interaction, and perspective taking—and 50 subcategories, comprising over 8,400 question-answer pairs. We experimentally demonstrate that existing open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning, and explore two strategies to enhance spatial reasoning: PointGraph (explicit scene graph cues) and SpatialCoT (new perspective chain of thought).

Takeaways, Limitations

Takeaways:
We present a new benchmark, OmniSpatial, that clearly demonstrates the limitations of the spatial inference capabilities of existing VLMs.
Proposed PointGraph and SpatialCoT strategies for improving spatial reasoning.
Presenting a more comprehensive and complex spatial reasoning task based on cognitive psychology.
Limitations:
Because OmniSpatial is still an early-stage benchmark, it may be necessary to add more diverse and complex spatial inference tasks in the future.
Further research is needed on the generalization performance and efficiency of the proposed PointGraph and SpatialCoT strategies.
The current benchmark may need to be further expanded.
👍