Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Towards General Continuous Memory for Vision-Language Models

Created by
  • Haebom

Author

Wenyi Wu, Zixuan Song, Kun Zhou, Yifei Shao, Zhiting Hu, Biwei Huang

Outline

In this paper, we propose an external memory system that efficiently provides multimodal and multilingual real-world knowledge to address the limitations of existing language models (LMs) and visual-language models (VLMs) that struggle to perform complex inference tasks. While existing approaches concatenate images and text tokens into long sequences, in this paper, we use continuous memory, a compact set of dense embeddings, to represent multimodal and multilingual knowledge more effectively and efficiently. The key idea is that the VLM itself can act as a continuous memory encoder. This improves the performance of complex multimodal inference tasks, and we present a data- and parameter-efficient method to fine-tune the VLM as a memory encoder using only 1.2% of the model parameters and 15.6K self-synthesized samples. The proposed method, called CoMEM, encodes arbitrary multimodal and multilingual knowledge into just eight continuous embeddings, and the VLM remains fixed during inference, allowing it to be flexibly integrated in a plug-and-play manner. We demonstrate the effectiveness of our approach through extensive experiments on eight multimodal inference benchmarks.

Takeaways, Limitations

Takeaways:
Achieving performance improvement of complex multi-modal inference tasks through a sequential memory system that efficiently utilizes VLM.
We present a data and parameter-efficient fine-tuning method.
Flexible integration with plug-and-play modules.
Proven effective on various multi-modal inference benchmarks.
Limitations:
Further validation is needed on the generalization performance of fine-tuning methods relying on our own synthetic data.
Further research is needed to determine whether the size of the continuous memory (8 embeddings) is sufficient for all kinds of complex inference tasks.
There may be a dependency on a specific VLM architecture.
👍