Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Created by
  • Haebom

Author

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao

Outline

This paper addresses the problem that multimodal large-scale language models (MLLMs) struggle to distinguish between task-relevant and irrelevant signals, leading to errors in tasks such as visual question answering (VQA). We define this limitation as the "cross-modal competence problem," and focus on "modal interference," a phenomenon in which noisy information from irrelevant modalities degrades performance in tasks that rely on a single modality, such as image classification or pure text question answering. In this paper, we design a perturbation-based causal diagnosis experiment to quantitatively measure modal interference and propose a novel framework for fine-tuning MLLMs using perturbation-based data augmentation and consistency regularization strategies, including heuristic perturbation and adversarial perturbation using projective gradient descent (PGD). We validate the effectiveness of the proposed method through experiments on various benchmark datasets (image-centric, text-centric, and VQA tasks) and multiple model families.

Takeaways, Limitations

Takeaways:
We present a novel method to clearly define and quantify the cross-modal competence problem of MLLMs, especially the modal interference problem.
We propose an effective fine-tuning framework to alleviate the modal interference problem.
We experimentally demonstrate that the proposed method improves performance on various benchmark datasets and models.
We demonstrate that it is possible to simultaneously improve unimodal inference ability and multimodal task performance.
Limitations:
The effectiveness of the proposed method may be limited to specific benchmark datasets and models.
Further experiments on more diverse and complex multimodal tasks are needed.
Adversarial training methods such as PGD can be computationally expensive.
Further research is needed on the generality and limitations of heuristic perturbation strategies.
👍