Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

Created by
  • Haebom

Author

Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang

Outline

This paper proposes M$^2$IV, a novel representation engineering technique for enhancing the efficiency of multimodal context learning (ICL) for large-scale vision-language models (LVLMs). To address the token-intensive nature of conventional ICLs and the complex cross-modal few-shot inference problem, M$^2$IV directly injects learnable multimodal in-context vectors into the residual stream of LVLMs, instead of explicit token-level demonstrations. By analyzing the roles of multi-head attention (MHA) and multilayer perceptrons (MLPs), we design a training strategy that enables fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV improves performance across diverse tasks and LVLMs, significantly reducing token overhead and enhancing scalability to multi-shot scenarios. Furthermore, we enhance usability by introducing VLibrary, which stores, retrieves, and utilizes trained M$^2$IVs. Experimental results show that M$^2$IV outperforms existing ICL and existing representation engineering techniques, achieving an average accuracy improvement of 3.74% and efficiency improvement.

Takeaways, Limitations

Takeaways:
A novel representation engineering technique, M$^2$IV, that significantly improves the efficiency of multimodal context learning is presented.
Improved scalability to many shot scenarios by reducing token overhead.
Performance improvements for various tasks and LVLMs (average accuracy improvement of 3.74%)
VLibrary, a trained M$^2$IV storage and retrieval system for ease of use
Limitations:
The performance improvements of M$^2$IV are based on experimental results on specific datasets and LVLMs, and further research is needed on generalization performance.
Consideration needs to be given to the scalability and maintainability of VLibrary.
Further research is needed to optimize M$^2$IV training strategies.
👍