Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Created by
  • Haebom

Author

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang

Outline

In this paper, we present a study on encoderless VLMs that are rapidly closing the performance gap with encoder-based vision-language models (VLMs). We systematically analyze the performance gap between conventional encoder-based VLMs and encoder-less VLMs using pre-trained vision encoders, discrete tokenizers, and minimal visual hierarchies, and deeply explore the features of encoderless VLMs. Through this, we develop an efficient strategy that is comparable to encoder-based VLMs and present an improved encoderless VLM, EVEv2.0. EVEv2.0 appropriately decomposes and hierarchically concatenates visual and language information to reduce inter-modal interference, and uses training strategies for effective optimization. Experimental results demonstrate that EVEv2.0 demonstrates excellent data efficiency and powerful visual inference capabilities.

Takeaways, Limitations

Takeaways:
Presenting an efficient strategy to improve the performance of encoder-less VLMs
Proposal of model structure and training strategy to reduce intermodal interference
Development of EVEv2.0 model with excellent data efficiency and visual reasoning ability
Helps reduce the performance gap with encoder-based models
Limitations:
There is a lack of specific reference to __T80190_____ in the EVEv2.0 model presented in this paper.
Further comparative analysis with other state-of-the-art encoder-less VLMs is needed.
Additional evaluation of generalization performance across a variety of vision-language tasks is needed.
👍