Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Created by
  • Haebom

Author

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie

Outline

SANA-Video is a compact diffusion model that can efficiently generate videos up to minutes long and up to 720x1280 resolution. Deployable on the RTX 5090 GPU, it boasts excellent text-to-video alignment and synthesizes high-resolution, high-quality long-form videos at lightning speed. It utilizes Linear DiT, which utilizes linear attention as its core operation, and a block-wise autoregressive approach to design a fixed-memory-cost KV cache for long-form video generation, enabling efficient long-form video generation. Through data filtering and model training strategies, it reduces training costs to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. It is competitive with state-of-the-art compact diffusion models while delivering 16x speedups, and deploys on the RTX 5090 GPU with NVFP4 precision, improving the speed of generating 5-second 720p videos by 2.4x.

Takeaways, Limitations

Takeaways:
Create high-quality, long-lasting videos at a low cost.
Improved accessibility by enabling deployment on RTX 5090 GPUs.
Create videos at high speed.
Efficiency achieved through linear attention and KV cache design.
Performance that competes with the latest compact diffuser models.
Limitations:
Lack of information on specific performance (e.g. FID, CLIP Score).
Lack of information about the model's generalization ability.
Lack of information about the complexity and number of parameters of the model.
👍