This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning
Created by
Haebom
Author
Binbin Ji, Siddharth Agrawal, Qiance Tang, Yvonne Wu
Outline
This study investigates the spatial reasoning capability of visual-language models (VLMs) using Chain-of-Thought (CoT) prompting and reinforcement learning. We find that while simple CoT formulations do not help improve performance or even degrade performance, multi-stage structured prompting based on visual graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. We fine-tune the model on the SAT dataset using Group Relative Policy Optimization (GRPO) and evaluate its performance on CVBench. Compared with supervised fine-tuning (SFT), GRPO achieves higher accuracy in Pass@1 evaluation and shows good robustness under out-of-distribution (OOD) conditions. In particular, SFT overfits surface-level linguistic patterns, which can lead to performance degradation when the syntactic changes at test time (e.g., from “closer to” to “farther from”), whereas GRPO generalizes more reliably and maintains stable performance under such changes. Our results provide insight into how reinforcement learning and structured prompting can improve spatial reasoning capability and generalization performance of state-of-the-art VLMs. All code is publicly available at https://github.com/Yvonne511/spatial-vlm-investigator .