[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning

Created by
  • Haebom

Author

Binbin Ji, Siddharth Agrawal, Qiance Tang, Yvonne Wu

Outline

This study investigates the spatial reasoning capability of visual-language models (VLMs) using Chain-of-Thought (CoT) prompting and reinforcement learning. We find that while simple CoT formulations do not help improve performance or even degrade performance, multi-stage structured prompting based on visual graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. We fine-tune the model on the SAT dataset using Group Relative Policy Optimization (GRPO) and evaluate its performance on CVBench. Compared with supervised fine-tuning (SFT), GRPO achieves higher accuracy in Pass@1 evaluation and shows good robustness under out-of-distribution (OOD) conditions. In particular, SFT overfits surface-level linguistic patterns, which can lead to performance degradation when the syntactic changes at test time (e.g., from “closer to” to “farther from”), whereas GRPO generalizes more reliably and maintains stable performance under such changes. Our results provide insight into how reinforcement learning and structured prompting can improve spatial reasoning capability and generalization performance of state-of-the-art VLMs. All code is publicly available at https://github.com/Yvonne511/spatial-vlm-investigator .

Takeaways, Limitations

Takeaways:
We demonstrate that SceneGraph CoT prompting improves the spatial inference performance of VLMs.
Achieving higher accuracy and OOD robustness than SFT through GRPO-based reinforcement learning.
We present the overfitting problem of SFT and the excellent generalization ability of GRPO.
A novel methodology is presented to improve the spatial reasoning ability of VLMs.
Limitations:
Further research is needed on the generalizability of the dataset and model used in the study.
Performance evaluation of GRPO on other types of spatial reasoning problems is needed.
Further analysis of the computational cost and efficiency of GRPO is needed.
👍