This paper systematically analyzes the phenomenon of "text dominance" in multimodal large-scale language models (MLLMs) that process various modalities (images, videos, audio, time series, and graphs). Text dominance refers to the phenomenon where MLLMs overrely on text without fully utilizing other modalities. The study presents two evaluation metrics, the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI), revealing the widespread prevalence of text dominance across various modalities. Causes of text dominance include attention dilution due to token redundancy in non-text modalities, the influence of fusion architecture design, and task formulations that favor text input. Furthermore, we demonstrate that a simple method called token compression can effectively address the attentional imbalance in models (e.g., reducing the MDI of LLaVA-7B from 10.23 to 0.86). This study provides a foundation for the development of more balanced and comprehensive multimodal language models.