Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Created by
  • Haebom

Author

Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

Outline

This paper studies the Euclidean geometry problem as a surrogate task for solving spatial intelligence, which encompasses various abilities such as visual shape transformation, object rotation, relative position judgment, and numerical estimation, in multimodal large-scale language models (MLLMs). We constructed the Euclid30K multimodal dataset consisting of approximately 30,000 plane and three-dimensional geometric problems, and fine-tuned the Qwen2.5VL and RoboBrain2.0 models using Group Relative Policy Optimization (GRPO). As a result, the models showed zero-shot performance improvements on four spatial inference benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) after training on Euclid30K without any separate task-specific adaptation. In particular, the average accuracy of all models on VSI-Bench increased by 5.5 percentage points, from 34.5% to 40.5%, and the RoboBrain2.0-Euclid-7B model achieved an accuracy of 49.6%, outperforming the previous best-performing model, Spatial-MLLM. This study systematically demonstrates for the first time that geometry-focused fine-tuning can impart broadly transferable spatial skills to vision-language models.

Takeaways, Limitations

Takeaways:
Improving the spatial reasoning capabilities of MLLMs by fine-tuning them using geometric problems.
Demonstrating the effectiveness of the Euclid30K dataset and the GRPO methodology.
Zero-shot performance improvements across various spatial inference benchmarks.
Achieve performance that surpasses existing top-performing models
Presenting a new approach to spatial intelligence research
Limitations:
Lack of information about the Euclid30K dataset and resources used to train the model.
Further validation of the model's generalization ability is needed.
Applicability and performance verification for other spatial inference-related tasks are needed.
Lack of in-depth analysis of the model's inference process
👍