[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Created by
  • Haebom

Author

Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Outline

In this paper, we propose VisionThink, a novel method to improve the efficiency of vision-language models (VLMs) by reducing the number of image tokens. Existing VLMs use many image tokens that are much longer than text tokens, but most practical tasks do not require such a large number of tokens. VisionThink starts by downsampling the image and judges whether it is sufficient to solve the problem. If not, it outputs a special token requesting a high-resolution image. Using reinforcement learning and LLM-as-Judge strategy, it is applied to general VQA tasks, and stable and reasonable image resizing ratios are achieved through reward functions and penalty mechanisms. It shows detailed visual understanding ability in OCR-related tasks, and greatly reduces the number of image tokens in simple tasks.

Takeaways, Limitations

Takeaways:
We demonstrate that the efficiency of VLM can be significantly improved by dynamically adjusting the number of image tokens.
It presents a more effective and flexible method than existing fixed token compression methods.
Successfully applied to general VQA tasks by leveraging reinforcement learning and LLM-as-Judge strategy.
It performs well in OCR-related tasks and effectively reduces the number of tokens in simple tasks.
Reproducibility is ensured through open code.
Limitations:
Further research may be needed on the generalization performance of the proposed method.
There may be performance degradation for certain types of tasks (e.g. some OCR-related tasks).
A detailed description of the training process in reinforcement learning may be lacking.
👍