Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Simulating the Real World: A Unified Survey of Multimodal Generative Models

Created by
  • Haebom

Author

Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong

Outline

This paper provides an integrated review of multimodal generative models for the task of understanding and replicating the real world in Artificial General Intelligence (AGI) research. While traditional approaches, such as world models, focus on capturing the fundamental principles governing the physical world, they tend to treat different modalities—2D images, videos, 3D, and 4D representations—as independent domains and overlook their interdependencies. This paper presents an integrated review of multimodal generative models that explore the progression of data dimensions in real-world simulations, starting with 2D generation (appearance) and progressing to video (appearance + dynamics), 3D generation (appearance + geometry), and finally 4D generation that integrates all dimensions. By providing a comprehensive review of datasets, evaluation metrics, and future directions, we offer guidance for future research and offer insights for new researchers.

Takeaways, Limitations

Takeaways:
The first attempt to systematically integrate 2D, video, 3D, and 4D generation within a single framework.
Providing an integrated framework for advancing multimodal generative models and real-world simulation research.
Provides a comprehensive review of datasets, evaluation metrics, and future research directions.
Providing new insights into AGI research.
Limitations:
This research is still in its early stages, and further research is needed to determine the performance and practical applicability of the 4D generative model.
A more in-depth analysis of the interactions and dependencies between different modalities is needed.
Further validation of the generality and scalability of the proposed framework is needed.
👍