This paper addresses the phenomenon in which visual-language models (VLMs) encounter knowledge conflicts between internal parameter knowledge and external information when performing complex tasks using multiple knowledge sources. Such conflicts can lead to hallucinations and unreliable responses, but their working mechanisms are not yet known. In this paper, we introduce a multimodal counterfactual queries dataset that intentionally contradicts internal common-sense knowledge and analyze the mechanism by which VLMs resolve cross-modal conflicts. Using logit inspection, we identify a small number of heads that control conflicts, and show that these heads can be modified to induce the model to produce results based on internal knowledge or visual input. Finally, we show that the attention of these heads accurately identifies local regions that cause visual overrides, and that it is more accurate than gradient-based attribution.