This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
While the Visual Language Model (VLM) has demonstrated the ability to understand and reason about visual content, it struggles with tasks requiring cross-view understanding and spatial reasoning. Current VLMs excel primarily at egocentric spatial reasoning from a camera perspective, but fail to generalize to allocentric viewpoints when applying the spatial reference frame of another object. ViewSpatial-Bench is the first comprehensive benchmark designed to evaluate multi-view spatial localization recognition. It includes five task types and supports an automatic 3D annotation pipeline that generates accurate orientation labels. A comprehensive evaluation of various VLMs on ViewSpatial-Bench reveals significant performance gaps. While the models perform reasonably well on camera-view tasks, their accuracy degrades when inferring from a human viewpoint. By fine-tuning the VLM on a multi-view spatial dataset, we achieve a 46.24% performance improvement across all tasks. This provides evidence that modeling 3D spatial relationships enhances the spatial understanding capabilities of the VLM.
Takeaways, Limitations
•
Currently, VLM is strong in egocentric (camera-view) spatial reasoning, but fails to generalize to other-centric viewpoints.
•
ViewSpatial-Bench is the first comprehensive benchmark for evaluating multi-view spatial localization recognition.
•
3D spatial relationship modeling enhances the spatial understanding capabilities of VLM.
•
Fine-tuning the VLM can improve overall performance.
•
This study provides an important benchmark for spatial intelligence.