This paper explores multimodal semantic segmentation to improve segmentation accuracy in complex scenes. Existing methods utilize feature fusion modules tailored to specific modalities, limiting input flexibility and increasing the number of training parameters. To address this, we propose StitchFusion, a simple yet effective modal fusion framework that directly integrates large-scale pre-trained models into the encoder and feature fusion. This approach enables comprehensive multimodal and multiscale feature fusion that accommodates all visual modal inputs. StitchFusion achieves modal integration by sharing multimodal visual information during encoding. To enhance information exchange between modalities, it introduces a multidirectional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multiscale information between pre-trained encoders, multimodal visual information integration during encoding is achieved. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on four multimodal segmentation datasets while minimizing the need for additional parameters. Additionally, experimental integration of the existing Feature Fusion Module (FFM) and MultiAdapter demonstrates their complementary properties.