This paper provides a comprehensive overview of multimodal generative AI. Focusing on two major techniques, multimodal large-scale language models (LLMs) and diffusion models, we closely review the probabilistic modeling procedures of each model, multimodal architecture designs, and advanced applications such as image/video LLMs and text-to-image/video generation. We also explore recent research trends in integrated models for understanding and generation, investigating key designs including autoregressive-based and diffusion-based modeling, and dense and mixed-expert (MoE) architectures, and analyzing several strategies for integrated models. We summarize popular multimodal generative AI pretraining datasets and suggest future research directions.