Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Created by
  • Haebom

Author

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang

ViewSpatial-Bench: Multi-Viewpoint Spatial Localization Recognition Evaluation

Outline

While the Visual Language Model (VLM) has demonstrated the ability to understand and reason about visual content, it struggles with tasks requiring cross-view understanding and spatial reasoning. Current VLMs excel primarily at egocentric spatial reasoning from a camera perspective, but fail to generalize to allocentric viewpoints when applying the spatial reference frame of another object. ViewSpatial-Bench is the first comprehensive benchmark designed to evaluate multi-view spatial localization recognition. It includes five task types and supports an automatic 3D annotation pipeline that generates accurate orientation labels. A comprehensive evaluation of various VLMs on ViewSpatial-Bench reveals significant performance gaps. While the models perform reasonably well on camera-view tasks, their accuracy degrades when inferring from a human viewpoint. By fine-tuning the VLM on a multi-view spatial dataset, we achieve a 46.24% performance improvement across all tasks. This provides evidence that modeling 3D spatial relationships enhances the spatial understanding capabilities of the VLM.

Takeaways, Limitations

Currently, VLM is strong in egocentric (camera-view) spatial reasoning, but fails to generalize to other-centric viewpoints.
ViewSpatial-Bench is the first comprehensive benchmark for evaluating multi-view spatial localization recognition.
3D spatial relationship modeling enhances the spatial understanding capabilities of VLM.
Fine-tuning the VLM can improve overall performance.
This study provides an important benchmark for spatial intelligence.
👍