Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints

Created by
  • Haebom

Author

Sunil Kumar, Bowen Zhao, Leo Dirac, Paulina Varshavskaya

Outline

This paper presents a method to improve the detailed visual reasoning ability of visual language models (VLMs) even under computationally limited conditions. Inspired by Deepseek-r1, we train small models using Group Relative Policy Optimization (GRPO) and leverage external tools such as zoom. We achieve the greatest benefit by combining GRPO training, a simple reward structure, a streamlined tool call interface, additional token allocation for tool call results, and a mix of training data that overrepresents visually challenging examples. Consequently, we achieve improved performance on some visual question answering (VQA) tasks compared to similarly sized baseline models, thanks to the detailed visual information collected from the external tools.

Takeaways, Limitations

Takeaways:
Suggesting the possibility of improving the visual reasoning ability of VLMs under limited computational resources.
Presenting effective learning strategies through the use of GRPO and external tools.
Proof of the utility of datasets that overrepresent visually challenging examples.
Improving VQA performance by collecting detailed visual information using external tools.
Limitations:
Only performance improvements for specific VQA tasks are presented, and generalizability to performance improvements for general VLMs may be limited.
Since the external tools used were limited to zoom, further research is needed on utilizing various external tools.
The effectiveness of the proposed method may depend on specific datasets and settings.
Further research is needed to determine generalizability to other VLM architectures or more complex visual reasoning tasks.
👍