This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
SANA-Video is a compact diffusion model that can efficiently generate videos up to minutes long and up to 720x1280 resolution. Deployable on the RTX 5090 GPU, it boasts excellent text-to-video alignment and synthesizes high-resolution, high-quality long-form videos at lightning speed. It utilizes Linear DiT, which utilizes linear attention as its core operation, and a block-wise autoregressive approach to design a fixed-memory-cost KV cache for long-form video generation, enabling efficient long-form video generation. Through data filtering and model training strategies, it reduces training costs to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. It is competitive with state-of-the-art compact diffusion models while delivering 16x speedups, and deploys on the RTX 5090 GPU with NVFP4 precision, improving the speed of generating 5-second 720p videos by 2.4x.
Takeaways, Limitations
•
Takeaways:
◦
Create high-quality, long-lasting videos at a low cost.
◦
Improved accessibility by enabling deployment on RTX 5090 GPUs.
◦
Create videos at high speed.
◦
Efficiency achieved through linear attention and KV cache design.
◦
Performance that competes with the latest compact diffuser models.
•
Limitations:
◦
Lack of information on specific performance (e.g. FID, CLIP Score).
◦
Lack of information about the model's generalization ability.
◦
Lack of information about the complexity and number of parameters of the model.