This paper provides an integrated review of multimodal generative models for the task of understanding and replicating the real world in Artificial General Intelligence (AGI) research. While traditional approaches, such as world models, focus on capturing the fundamental principles governing the physical world, they tend to treat different modalities—2D images, videos, 3D, and 4D representations—as independent domains and overlook their interdependencies. This paper presents an integrated review of multimodal generative models that explore the progression of data dimensions in real-world simulations, starting with 2D generation (appearance) and progressing to video (appearance + dynamics), 3D generation (appearance + geometry), and finally 4D generation that integrates all dimensions. By providing a comprehensive review of datasets, evaluation metrics, and future directions, we offer guidance for future research and offer insights for new researchers.