Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Created by
  • Haebom

Author

Zhang Li, Biao Yang, Qiang Liu, Shuo Zhang, Zhiyin Ma, Liang Yin, Linger Deng, Yabo Sun, Yuliang Liu, Xiang Bai

Outline

This paper proposes LIRA, a novel framework for improving the accuracy of large-scale multimodal models (LMMs). While LMMs excel in segmentation and understanding, they suffer from two limitations: inaccurate segmentation and hallucination. LIRA overcomes these limitations by leveraging the complementary relationship between visual understanding and segmentation. Its main component, the Semantic-Enhanced Feature Extractor (SEFE), fuses semantic and pixel-level features to improve object attribute inference and enable more accurate segmentation. Another component, Interleaved Local Visual Coupling (ILVC), extracts local features based on segmentation masks and then autoregressively generates local descriptions, providing fine-grained supervision to mitigate hallucinations. To quantify the correlation between the accuracy of object segmentation and the potential associated meaning of tokens, we introduce the Attributes Evaluation (AttrEval) dataset. Experimental results show that LIRA achieves state-of-the-art performance on both segmentation and understanding tasks.

Takeaways, Limitations

Takeaways:
A novel approach to address the inaccurate segmentation and hallucinatory understanding problems of LMMs is presented.
Improved segmentation accuracy and comprehension ability with SEFE and ILVC.
We investigate the correlation between object segmentation accuracy and potential related meanings and present the AttrEval dataset.
Achieving state-of-the-art performance on a variety of segmentation and comprehension tasks.
Limitations:
Further validation of the scale and generalization performance of the presented AttrEval dataset is needed.
LIRA's performance improvements may be limited to specific datasets or tasks.
Analysis of the computational cost and complexity of the LIRA framework is needed.
👍