This paper addresses the problem that multimodal large-scale language models (MLLMs) struggle to distinguish between task-relevant and irrelevant signals, leading to errors in tasks such as visual question answering (VQA). We define this limitation as the "cross-modal competence problem," and focus on "modal interference," a phenomenon in which noisy information from irrelevant modalities degrades performance in tasks that rely on a single modality, such as image classification or pure text question answering. In this paper, we design a perturbation-based causal diagnosis experiment to quantitatively measure modal interference and propose a novel framework for fine-tuning MLLMs using perturbation-based data augmentation and consistency regularization strategies, including heuristic perturbation and adversarial perturbation using projective gradient descent (PGD). We validate the effectiveness of the proposed method through experiments on various benchmark datasets (image-centric, text-centric, and VQA tasks) and multiple model families.