Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

Created by
  • Haebom

Author

Muhammad Adnan, Nithesh Kurella, Akhil Arunkumar, Prashant J. Nair

Outline

Diffusion Transformers (DiT) achieve state-of-the-art results in text-to-image and text-to-video generation and editing, but video generation is computationally expensive due to the large model size and the quadratic cost of spatial-temporal attention across multiple denoising stages. Static caching mitigates this by reusing features at fixed stages, but fails to adapt to the generation dynamics, resulting in a suboptimal trade-off between speed and quality. In this paper, we propose Foresight, an adaptive layer reuse technique that reduces inter-stage computational redundancy across all layers while maintaining baseline performance. Foresight dynamically identifies and reuses inter-stage DiT block outputs across all layers to optimize efficiency by adapting to generation parameters such as resolution and denoising schedule. Implemented on OpenSora, Latte, and CogVideoX, Foresight achieves end-to-end speedups of up to \latencyimprv while maintaining video quality. The source code for Foresight is available at https://github.com/STAR-Laboratory/foresight .

Takeaways, Limitations

Takeaways: We present Foresight, a novel adaptive layer reuse technique that effectively reduces the computational cost of diffusion transformer-based video generation models. We demonstrate speedups on various models, including OpenSora, Latte, and CogVideoX. We also increase reproducibility and scalability by making the source code publicly available.
Limitations: The \latencyimprv value is not specifically provided, making it difficult to accurately assess the extent of the speedup. Further analysis is needed on generalization performance for various video generation models and its dependence on specific hardware environments. Analysis of changes in memory usage due to the application of Foresight is lacking.
👍