Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Yan: Foundational Interactive Video Generation

Created by
  • Haebom

Author

Deheng Ye, Fangyun Zhou, Jiacheng Lv, Jianqi Ma, Jun Zhang, Junyan Lv, Junyou Li, Minwen Deng, Mingyu Yang, Qiang Fu, Wei Yang, Wenkai Lv, Yangbin Yu, Yewen Wang, Yonghang Guan, Zhihao Hu, Zhongbin Fang, Zhongqian Sun

Outline

Yan is a foundational framework that encompasses the entire interactive video generation pipeline, from simulation, generation, and editing. It consists of three core modules. For AAA-level simulations, we designed a highly compressed, low-latency 3D-VAE and a KV-Cache-based shift-window denoising inference process to achieve real-time 1080P/60FPS interactive simulation. For multimodal generation, we infuse game-specific knowledge into an open-domain multimodal video diffusion model (VDM) and then introduce a hierarchical autoregressive captioning method that transforms the VDM into a frame-by-frame, action-controlled, real-time, infinite interactive video generator. Even when text and visual prompts come from different domains, the model demonstrates strong generalization, allowing for flexible mixing and composing of cross-domain styles and mechanisms based on user prompts. For multi-particle editing, we propose a hybrid model that explicitly separates interaction mechanism simulation and visual rendering, enabling text-based, interactive editing of multi-particle video content. By integrating these modules, Yan advances interactive video generation beyond an isolated function into a comprehensive AI-driven interactive generation paradigm, paving the way for the next generation of creative tools, media, and entertainment.

Takeaways, Limitations

Takeaways:
Implementing AAA-quality interactive video simulation in real-time 1080P/60FPS.
Multimodal video generation and cross-domain style mixing capabilities leveraging game-specific knowledge.
Provides text-based multi-particle video content editing capabilities.
Introducing a new paradigm in interactive video creation and suggesting the potential for next-generation creative tools.
Limitations:
The paper lacks specific references to Limitations or future research directions.
Lack of detailed information on model performance evaluation (lack of quantitative evaluation metrics and results presentation)
Lack of information about the model's training data and resource consumption.
👍