In this paper, we propose an external memory system that efficiently provides multimodal and multilingual real-world knowledge to address the limitations of existing language models (LMs) and visual-language models (VLMs) that struggle to perform complex inference tasks. While existing approaches concatenate images and text tokens into long sequences, in this paper, we use continuous memory, a compact set of dense embeddings, to represent multimodal and multilingual knowledge more effectively and efficiently. The key idea is that the VLM itself can act as a continuous memory encoder. This improves the performance of complex multimodal inference tasks, and we present a data- and parameter-efficient method to fine-tune the VLM as a memory encoder using only 1.2% of the model parameters and 15.6K self-synthesized samples. The proposed method, called CoMEM, encodes arbitrary multimodal and multilingual knowledge into just eight continuous embeddings, and the VLM remains fixed during inference, allowing it to be flexibly integrated in a plug-and-play manner. We demonstrate the effectiveness of our approach through extensive experiments on eight multimodal inference benchmarks.