Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation

Created by
  • Haebom

Author

Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

Outline

This paper explores multimodal semantic segmentation to improve segmentation accuracy in complex scenes. Existing methods utilize feature fusion modules tailored to specific modalities, limiting input flexibility and increasing the number of training parameters. To address this, we propose StitchFusion, a simple yet effective modal fusion framework that directly integrates large-scale pre-trained models into the encoder and feature fusion. This approach enables comprehensive multimodal and multiscale feature fusion that accommodates all visual modal inputs. StitchFusion achieves modal integration by sharing multimodal visual information during encoding. To enhance information exchange between modalities, it introduces a multidirectional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding. By leveraging MultiAdapter to propagate multiscale information between pre-trained encoders, multimodal visual information integration during encoding is achieved. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on four multimodal segmentation datasets while minimizing the need for additional parameters. Additionally, experimental integration of the existing Feature Fusion Module (FFM) and MultiAdapter demonstrates their complementary properties.

Takeaways, Limitations

Takeaways:
We propose StitchFusion, a simple and effective multimodal semantic segmentation framework.
Increase input flexibility and reduce training parameters by directly leveraging pre-trained models.
Effective cross-modal information transfer and multi-scale information integration through MultiAdapter
Achieving state-of-the-art performance on four multimodal segmented datasets.
Verifying complementarity with existing feature fusion modules
Ensuring reproducibility through open code
Limitations:
There is a possibility that the performance of the proposed method may be biased on a specific dataset (performance verification on additional datasets is required).
Lack of detailed explanation of the design and parameter tuning of MultiAdapter (specific design process and optimization strategy are needed)
Further analysis of performance and efficiency in real-world applications is needed.
👍