Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

Created by
  • Haebom

Author

Jingchao Wang, Zhijian Wu, Dingjiang Huang, Yefeng Zheng, Hong Wang

Outline

This paper proposes MLLMSeg, a novel approach to the Reference Representation Segmentation (RES) problem. Existing RES methods trade off between performance and cost, either using the heavily parameterized Segment Anything Model (SAM) or using lightweight SAM-free pipelines that sacrifice accuracy. MLLMSeg achieves high performance without an additional vision encoder by leveraging visual detail features already embedded in the vision encoder of the Multimodal Large-Scale Model (MLLM). Accurate mask prediction is achieved through a detail-enhanced and semantic-consistent feature fusion (DSFF) module that combines detail and semantic information, and a lightweight mask decoder (34M parameters). Experimental results demonstrate that MLLMSeg outperforms both SAM-based and SAM-free methods, striking a good balance between performance and cost.

Takeaways, Limitations

Takeaways:
We demonstrate that by effectively leveraging the visual detail capabilities inherent in MLLM's vision encoder, superior performance can be achieved without an additional vision encoder.
Improve accuracy by effectively fusing detail and semantic information through the DSFF module.
Maintain high performance while reducing computational costs through a lightweight mask decoder.
Achieves performance superior to SAM-based and SAM-free methods.
Limitations:
The performance of MLLMSeg may depend on the performance of the MLLM used.
Optimized for a specific MLLM, there is a possibility of performance degradation when applied to other MLLMs.
Generalization performance for complex backgrounds or ambiguous reference expressions requires further study.
👍