This paper presents ViCA2, a novel multimodal large-scale language model (MLLM) for enhancing visual-spatial cognition, specifically the ability to infer spatial layout, relationships, and dynamics. ViCA2 features a dual visual encoder architecture that integrates SigLIP for semantics and Hiera for spatial structure, as well as a token rate control mechanism for efficiency. Furthermore, we developed ViCA-322K, a large-scale dataset consisting of over 320,000 spatial question-answer pairs, to perform goal-directed tuning. The ViCA2-7B model achieved a state-of-the-art average score of 56.8 on the VSI-Bench benchmark, outperforming large open-source and proprietary models such as LLaVA-NeXT-Video-72B and Gemini-1.5 Pro. We make ViCA2, its codebase, and the ViCA-322K dataset publicly available to support further research.