Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Created by
  • Haebom

Author

Yizhou Wang, Song Mao, Yang Chen, Yufan Shen, Yinqiao Yan, Pinlong Cai, Ding Wang, Guohang Yan, Zhi Yu, Xuming Hu, Botian Shi

Outline

Multimodal Large Language Models (MLLMs) integrate multiple vision encoders to capture diverse visual signals. However, in practice, encoder redundancy often leads to performance degradation. This study identifies this redundancy through encoder masking and measures the contribution and efficiency of encoders using Conditional Utilization Rate (CUR) and Information Gap (IG). The results demonstrate that a single encoder is dominant for specific tasks, while encoders are interchangeable for general VQA and knowledge-based tasks. Furthermore, we demonstrate that masking a specific encoder can yield higher accuracy than the full model.

Takeaways, Limitations

Takeaways:
We raise the issue of efficiency of MLLM using multiple encoders and show that more encoders do not always guarantee better performance.
We present a method to quantitatively analyze the contribution and redundancy of encoders using CUR and IG metrics.
We demonstrate that masking specific encoders can improve model performance.
Provides diagnostic information for efficient architecture design during MLLM development.
Limitations:
There may be limitations in generalizability based on experimental results for specific models and tasks.
No specific suggestions for improved architectural design are presented.
Further research is needed on the interaction and dynamic utilization between encoders.
👍