Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Yume: An Interactive World Generation Model

Created by
  • Haebom

Author

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang

Outline

Yume is a project that aims to generate interactive, realistic, and dynamic worlds using images, text, or videos. Users can explore and control these worlds using peripherals or neural signals. In this paper, we present a prototype of Yume that generates dynamic worlds from input images and enables world exploration via keyboard manipulation. For high-quality interactive video world generation, we introduce a well-designed framework consisting of four major components: camera motion quantization, video generation architecture, advanced sampler, and model acceleration. The main technical features include camera motion quantization for stable training and user-friendly keyboard input, Masked Video Diffusion Transformer (MVDT) with memory module for infinite video generation in an autoregressive manner, Anti-Artifact Mechanism (AAM) and Stochastic Differential Equations (SDE)-based Time Travel Sampling (TTS-SDE) that does not require training for better visual quality and more accurate control, and model acceleration through synergistic optimization of adversarial distillation and caching mechanisms. We trained Yume using Sekai, a high-quality world exploration dataset, and achieved remarkable results across a variety of scenarios and applications. All data, codebase, and model weights are available at https://github.com/stdstu12/YUME , and Yume will be updated monthly.

Takeaways, Limitations

Takeaways:
Presenting technology to create interactive and realistic virtual worlds using images, text, and videos
Intuitive world exploration via keyboard input
High-quality video creation and precise control through innovative technologies such as MVDT, AAM, and TTS-SDE
Applying efficient optimization techniques for model acceleration
Contributes to research and development by being released as open source
Limitations:
The current version relies only on keyboard input, and peripheral or neural signal control is not yet implemented.
This is a beta version and requires further development before full functionality is implemented.
Lack of detailed description of Sekai dataset
Lack of validation for performance degradation or stability issues that may occur over long periods of use.
👍