Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity

Created by
  • Haebom

Author

Zhibin Lan, Liqiang Niu, Fandong Meng, Wenbo Li, Jie Zhou, Jinsong Su

Outline

To address the problem of generating a large number of visual tokens in high-resolution image processing, this paper proposes AVG-LLaVA, a large-scale multimodal model (LMM) that adaptively selects visual granularity based on input images and directives. AVG-LLaVA generates visual tokens of various granularities through multiple pooling layers and selects an appropriate granularity using a visual granularity router consisting of a Transformer, an MLP, and a voter layer. Furthermore, we present RGLF, a novel training method that aligns the router's predictions with the LMM's preferences without requiring additional manual annotation. Experimental results show that AVG-LLaVA achieves excellent performance on 11 benchmarks, significantly reduces the number of visual tokens, and improves inference speed (e.g., an 85.3% reduction in visual tokens and a 2.53x increase in inference speed on the AI2D benchmark).

Takeaways, Limitations

Takeaways:
A novel approach to effectively address the problem of excessive visual tokens that arises when processing high-resolution images is presented.
Adaptively adjust visual granularity based on input images and instructions to improve performance and efficiency.
We present an RGLF training method that improves the model's ability to select visual granularity without additional data.
Demonstrated superior performance and efficiency over existing models in various benchmarks.
Limitations:
Further research is needed to investigate the generalization performance of the proposed RGLF training method and its applicability to other LMMs.
Robust assessment of various types of high-resolution images and complex instructions is required.
Analysis of the complexity and computational cost of visual granularity routers is needed.
👍