Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models

Created by
  • Haebom

Author

Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang

Outline

This paper systematically analyzes the phenomenon of "text dominance" in multimodal large-scale language models (MLLMs) that process various modalities (images, videos, audio, time series, and graphs). Text dominance refers to the phenomenon where MLLMs overrely on text without fully utilizing other modalities. The study presents two evaluation metrics, the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI), revealing the widespread prevalence of text dominance across various modalities. Causes of text dominance include attention dilution due to token redundancy in non-text modalities, the influence of fusion architecture design, and task formulations that favor text input. Furthermore, we demonstrate that a simple method called token compression can effectively address the attentional imbalance in models (e.g., reducing the MDI of LLaVA-7B from 10.23 to 0.86). This study provides a foundation for the development of more balanced and comprehensive multimodal language models.

Takeaways, Limitations

Takeaways:
We systematically reveal for the first time the severity and pervasiveness of text dominance in a multimodal large-scale language model.
Analyze the causes of the text dominance phenomenon from various perspectives and suggest solutions.
The proposed evaluation indices (MDI, AEI) and token compression method can be useful for future multimodal model development and evaluation.
Presents an important milestone toward the development of more balanced and comprehensive multimodal language models.
Limitations:
Further research is needed to determine the generality of the proposed token compression method and its applicability to other models/datasets.
The cause analysis of the text dominance phenomenon needs to be supplemented through more in-depth research.
A comprehensive analysis of various fusion architectures and task formulations may be lacking.
👍