Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Enhancing Long Video Generation Consistency without Tuning

Created by
  • Haebom

Author

Xingyao Li, Fengzhuo Zhang, Jiachun Pan, Yunlong Hou, Vincent YF Tan, Zhuoran Yang

Outline

This paper focuses on improving the consistency of long-form video generation, especially the smoothness and transitions between scenes. To improve the consistency and cohesion in video generation using single or multiple prompts, we propose a time-frequency based temporal attention reweighting algorithm (TiARA) based on the Discrete Short-Time Fourier Transform (DSFT). TiARA improves the inter-frame consistency by editing the attention score matrix through frequency-based analysis. In addition, we identify important factors such as prompt alignment for videos generated with multiple prompts and propose PromptBlend, an advanced prompt interpolation pipeline that systematically aligns the prompts. Experimental results verify the effectiveness of the proposed method, showing consistent and significant performance improvements over several baseline models.

Takeaways, Limitations

Takeaways:
We first apply a frequency-based method in a video diffusion model to improve the consistency of long-term video generation.
We present the TiARA and PromptBlend algorithms, which contribute to improving the consistency and cohesion of video generation in both single and multiple prompts.
We reveal the importance of prompt alignment in multi-prompt video generation and propose a method to improve it.
The effectiveness of the proposed method is demonstrated through experiments on various reference models.
Limitations:
There is a lack of analysis of the computational cost and complexity of the proposed method.
Further research is needed on generalization performance on various types of video datasets.
Analysis is needed to determine the potential for performance degradation for specific types of prompts or videos.
👍