Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Explaining multimodal LLMs via intra-modal token interactions

Created by
  • Haebom

Author

Jiawei Liang, Ruoyu Chen, Xianghao Jiao, Siyuan Liang, Shiming Liu, Qunli Zhang, Zheng Hu, Xiaochun Cao

Improving the Interpretability of Multimodal Large Language Models

Outline

This paper addresses the lack of a sufficient understanding of the internal decision-making mechanisms of multimodal large-scale language models (MLLMs), which have achieved remarkable success in various visual-language tasks. Existing interpretability research has primarily focused on cross-modal features, but has tended to overlook unimodal dependencies. To address this, this paper proposes a method that leverages unimodal interactions to enhance interpretability. For visual branching, we introduce *Multiscale Explanation Aggregation (MSEA),* which dynamically adjusts receptive fields by aggregating features across multi-scale inputs, generating more holistic and spatially consistent visual explanations. For text branching, we propose *Activation Rank Correlation (ARC),* to measure the relevance of contextual tokens, suppressing erroneous activations in irrelevant contexts and maintaining semantically consistent activations.

Takeaways, Limitations

Takeaways:
Improving the interpretability of MLLM by leveraging single-modal interactions.
Provide a more holistic and spatially consistent visual description through MSEA.
Suppressing incorrect activations in text branches via ARC
Provides more accurate and detailed explanations than existing interpretability methods
Limitations:
It is difficult to determine the specific Limitations from the content of the paper alone (additional information is needed)
👍