Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Visuospatial Cognitive Assistant

Created by
  • Haebom

Author

Qi Feng

Outline

This paper addresses the challenge of video-based spatial perception, essential for robotics and embedded AI, for current Vision-Language Models (VLMs). We present ViCA-322K, a diverse dataset consisting of 322,003 question-answer pairs derived from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), which provides guidance for 3D metadata-based querying and video-based complex inference. Furthermore, we develop the ViCA-7B model, fine-tuned on ViCA-322K, and demonstrate that it achieves state-of-the-art performance on all eight VSI-Bench tasks, outperforming larger models (e.g., +26.1 in absolute distance). To enhance interpretability, we present the ViCA-Thinking-2.68K dataset, which includes an explicit inference chain, and fine-tune ViCA-7B to produce the ViCA-7B-Thinking model, which explicitly explains spatial inference. This study highlights the importance of goal-oriented data, provides directions for improved spatiotemporal modeling, and promotes robust visuospatial intelligence research by making all resources available.

Takeaways, Limitations

Takeaways:
We contribute to training video-based spatial inference models by presenting ViCA-322K, a large-scale and diverse question-answering dataset based on real-world indoor videos.
Achieving state-of-the-art performance across multiple VSI-Bench challenges with the ViCA-7B model.
Improved model interpretability with the ViCA-Thinking-2.68K dataset and ViCA-7B-Thinking model, which demonstrate the explicit inference process.
Emphasizes the importance of goal-oriented data and suggests directions for improved temporal-spatial modeling.
Research activation through disclosure of all research resources.
Limitations:
The paper does not explicitly mention the specific Limitations. Additional experiments and analyses are needed to better understand the limitations of the dataset, the model's generalization performance, and its vulnerability to specific types of spatial inference tasks.
The size of ViCA-Thinking-2.68K is relatively small, requiring further verification of the generalizability of the inference process.
👍