Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Created by
  • Haebom

Author

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang

Outline

This paper proposes MLLMSeg, a novel approach to the Reference Representation Segmentation (RES) problem. Existing RES methods suffer from a trade-off between performance and cost. We focus on achieving efficient performance without an additional visual encoder by leveraging the fine-grained visual features inherent in the visual encoder of a large-scale multimodal model (MLLM). Specifically, we propose a Detailed Enhancement and Semantic Consistency Feature Fusion Module (DSFF), which fully integrates the detailed visual features with the semantic features output by the MLLM's large-scale language model (LLM). Furthermore, we perform accurate mask prediction using a lightweight mask decoder with a small parameter count of 34M. Experimental results demonstrate that MLLMSeg outperforms both SAM-based and non-SAM-based competing methods, striking a good balance between performance and cost.

Takeaways, Limitations

Takeaways:
Effectively exploiting the inherent features of MLLM's visual encoder to achieve excellent performance without additional visual encoder.
Reduced computational costs due to lightweight architecture (34M parameters).
It outperforms both SAM-based and non-SAM-based methods.
Effectively fuse visual details and semantic information through the DSFF module.
Limitations:
The performance improvement of the proposed method may depend on the specific MLLM architecture.
Further validation is needed on generalization performance for various types of reference representations or complex images.
Although the 34M parameter is relatively small, further research is needed on its applicability in limited environments such as embedded systems.
👍