Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

True Multimodal In-Context Learning Needs Attention to the Visual Context

Created by
  • Haebom

Author

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, Jindong Gu

Outline

This paper focuses on improving the multimodal context-in-learning (MICL) capabilities of multimodal large-scale language models (MLLMs). We note that existing MLLMs struggle to leverage visual information and overrely on text patterns, leading to mere text imitation rather than true multimodal adaptation. To address these issues, we propose Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that rebalances attention between visual and textual tokens to direct the model's attention to the visual context. Furthermore, we propose TrueMICL, a MICL-specific dataset that explicitly requires the integration of multimodal information, particularly visual content, for accurate task completion. Experimental results demonstrate that the proposed method significantly improves true multimodal context-in-learning capabilities.

Takeaways, Limitations

Takeaways:
We present DARA, an effective fine-tuning strategy for improving the MICL capability of MLLM.
Release of TrueMICL, a MICL-specific dataset that explicitly requires visual information integration.
TrueMICL overcomes the limitations of existing MICL assessments and enables true multimodal learning ability assessment.
Experimentally demonstrating that the combination of DARA and TrueMICL improves the learning performance of MLLM in multimodal contexts.
Limitations:
The effectiveness of DARA and TrueMICL may be limited to specific datasets and models. Generalization performance on other datasets and models is needed.
The TrueMICL dataset may not be large enough and needs to be expanded to include more types of visual information and tasks.
Further research is needed to determine whether the proposed method is applicable to all types of MLLM.
👍