This paper proposes a multimodal framework that combines body-conducted microphone signals (BMS) and acoustic microphone signals (AMS). BMS is robust to noise but suffers from loss of high-frequency information, while AMS is rich in high-frequency information but susceptible to noise. This study addresses these shortcomings by using two networks: a mapping-based model that enhances BMS and a masking-based model that removes noise from AMS. The two models are integrated through a dynamic fusion mechanism that adapts to local noise conditions, optimally leveraging the strengths of each modality. Evaluation using objective speech quality metrics, including DNS-2023 noise clips added to the TAPS dataset, demonstrates superior performance compared to single-modal approaches in various noise environments.