This paper proposes a novel framework for continuous learning in scenarios involving multiple modalities (images, video, audio, depth, and text). To overcome the limitations of existing single-modality continuous learning methods, we employ an approach that trains models that align various modalities with text. To address the problem of forgetting existing knowledge due to differences between modalities, we present a framework that integrates knowledge within modalities and integrates relevant cross-modal information. This framework self-regulates changes in learned representations to gradually incorporate new knowledge and selectively integrates previously learned knowledge from modalities based on their interrelationships, mitigating interference between modalities. Furthermore, we introduce a strategy to realign modality embeddings to address biased alignment across modalities. We evaluate the proposed method on a wide range of continuous learning scenarios on multiple datasets using different modalities, and experimentally demonstrate that it outperforms existing methods, regardless of whether the modality identity is specified.