Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Created by
  • Haebom

Author

Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang

Outline

This paper addresses the current problem of object perception in large multimodal models (LMMs). Specifically, we demonstrate that LMMs trained without explicit object perception-related data can exhibit object perception capabilities, and we present an "attend-and-segment" method utilizing attention maps to verify this. Furthermore, we propose DIFFLMM, a diffusion-based visual encoder, to enhance object perception performance. DIFFLMM demonstrates generalizability and scalability without constraints on object perception-specific data, and outperforms existing models on object perception-related and general visual question-answering benchmarks. Notably, the model achieves an object perception mask recall of 44.2, outperforming GLaMM, even when trained without object perception-related data.

Takeaways, Limitations

Takeaways:
We found that object perception ability can be demonstrated even in LMMs trained without explicit object perception-related data.
We present a novel approach to verify the object perception ability of LMMs using the "attend-and-segment" method.
We improve object perception ability and generalizability by using DIFFLMM with diffusion-based visual encoder.
Achieved competitive performance on object perception-related and general visual question-answering benchmarks.
It outperformed GLaMM in models trained without object perception-related data.
Limitations:
Although the paper does not specifically mention Limitations, further research is needed to further explore the performance enhancements of DIFFLMM and its limitations in object recognition capabilities.
👍