This paper proposes M$^2$IV, a novel representation engineering technique for enhancing the efficiency of multimodal context learning (ICL) for large-scale vision-language models (LVLMs). To address the token-intensive nature of conventional ICLs and the complex cross-modal few-shot inference problem, M$^2$IV directly injects learnable multimodal in-context vectors into the residual stream of LVLMs, instead of explicit token-level demonstrations. By analyzing the roles of multi-head attention (MHA) and multilayer perceptrons (MLPs), we design a training strategy that enables fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV improves performance across diverse tasks and LVLMs, significantly reducing token overhead and enhancing scalability to multi-shot scenarios. Furthermore, we enhance usability by introducing VLibrary, which stores, retrieves, and utilizes trained M$^2$IVs. Experimental results show that M$^2$IV outperforms existing ICL and existing representation engineering techniques, achieving an average accuracy improvement of 3.74% and efficiency improvement.