Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

Created by
  • Haebom

Author

Qi Feng

Outline

This paper presents ViCA2, a novel multimodal large-scale language model (MLLM) for enhancing visual-spatial cognition, specifically the ability to infer spatial layout, relationships, and dynamics. ViCA2 features a dual visual encoder architecture that integrates SigLIP for semantics and Hiera for spatial structure, as well as a token rate control mechanism for efficiency. Furthermore, we developed ViCA-322K, a large-scale dataset consisting of over 320,000 spatial question-answer pairs, to perform goal-directed tuning. The ViCA2-7B model achieved a state-of-the-art average score of 56.8 on the VSI-Bench benchmark, outperforming large open-source and proprietary models such as LLaVA-NeXT-Video-72B and Gemini-1.5 Pro. We make ViCA2, its codebase, and the ViCA-322K dataset publicly available to support further research.

Takeaways, Limitations

Takeaways:
Despite being a small-scale model (7B), it achieved visual spatial inference performance that surpassed existing large-scale models.
We demonstrate the effectiveness of the dual visual encoder architecture with a new dataset, ViCA-322K.
Open access to models, code, and datasets can facilitate further research.
Limitations:
Performance on benchmarks other than the VSI-Bench benchmark was not evaluated.
Further evaluation of the model's generalization ability is needed.
There is a lack of analysis on the bias and generalizability of the ViCA-322K dataset.
👍