This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
Created by
Haebom
Author
Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang
Outline
To address the high computational overhead of existing large-scale video generation models, this paper proposes Autoregressive Adversarial Post-Training (AAPT), which transforms a pre-trained latent video diffusion model into a real-time, interactive video generator. This model autoregressively generates one latent frame at a time using a single neural network evaluation (1NFE). Users can stream the results in real time and receive interactive responses as controls for generating the next latent frame. We design an efficient one-stage generative architecture using adversarial training, maximize the use of the KV cache, and reduce error accumulation during long-duration video generation through a student-forced approach. Experimental results demonstrate that the 8B model achieves real-time streaming video generation at 24 fps at 736x416 resolution on a single H100 and up to 1 minute (1440 frames) of 1280x720 resolution video on 8xH100.
Takeaways, Limitations
•
Takeaways:
◦
Create real-time, interactive videos.
◦
Efficient frame generation with single neural network evaluation (1NFE).
◦
Efficient architecture design using adversarial training.
◦
Reduction of error accumulation through student-mandated methods.
◦
Supports high-resolution and long-duration video generation.
•
Limitations:
◦
Although there is no direct mention of Limitations in the paper, accessibility in general environments may be limited given the model size and hardware requirements (H100).
◦
Additional information is needed about detailed model performance and generated results (e.g., quality of generated video, performance in various scenarios, etc.).