Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

StreamDiT: Real-Time Streaming Text-to-Video Generation

Created by
  • Haebom

Author

Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao

Outline

This paper points out that despite recent advances in text-to-video generation using large-scale transformer-based diffusion models, existing models only generate short videos and have limitations in real-time interactive applications. In this paper, we propose StreamDiT, a real-time streaming video generation model. StreamDiT improves content consistency and image quality through flow matching-based learning with moving buffers and blended learning using various buffer frame segmentation methods. We adopt adaLN DiT-based modeling using variational temporal embedding and windowed attention, and train a StreamDiT model with 4 billion parameters. In addition, we propose a multi-stage distillation method customized for StreamDiT, which performs sampling distillation at each segmentation interval and reduces the number of function evaluations to achieve real-time performance (16 FPS, 512p resolution). We verify the performance through quantitative indicators and human evaluation, and suggest its potential for real-time applications such as streaming generation, interactive generation, and video-to-video conversion.

Takeaways, Limitations

Takeaways:
Proposing a StreamDiT model that enables real-time streaming video generation
Real-time processing of 4 billion parameter models at 16 FPS
Improve content consistency and image quality through blended learning and multi-stage distillation techniques
Offers a wide range of real-time application possibilities, including streaming creation, interactivity creation, and video-to-video conversion.
Limitations:
Performance currently limited to 512p resolution. Further research is needed to support higher resolutions.
Further research is needed to determine the generalizability of the proposed distillation method.
Lack of detailed analysis of the model's computational cost and memory consumption.
Further research is needed on robustness for diverse text inputs.
👍