This paper addresses the challenge of video-based spatial perception, essential for robotics and embedded AI, for current Vision-Language Models (VLMs). We present ViCA-322K, a diverse dataset consisting of 322,003 question-answer pairs derived from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), which provides guidance for 3D metadata-based querying and video-based complex inference. Furthermore, we develop the ViCA-7B model, fine-tuned on ViCA-322K, and demonstrate that it achieves state-of-the-art performance on all eight VSI-Bench tasks, outperforming larger models (e.g., +26.1 in absolute distance). To enhance interpretability, we present the ViCA-Thinking-2.68K dataset, which includes an explicit inference chain, and fine-tune ViCA-7B to produce the ViCA-7B-Thinking model, which explicitly explains spatial inference. This study highlights the importance of goal-oriented data, provides directions for improved spatiotemporal modeling, and promotes robust visuospatial intelligence research by making all resources available.