Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders

Created by
  • Haebom

Author

Dohun Lee, Hyeonho Jeong, Jiwook Kim, Duygu Ceylan, Jong Chul Ye

Outline

This paper focuses on improving the feature representation capability of video diffusion models. Unlike previous research on video diffusion models, which primarily focused on architectural innovation or new learning objectives, this paper aims to improve performance by aligning the feature representations of pre-trained vision encoders with the intermediate features of a video generator. We evaluate suitable encoders by analyzing the discriminability and temporal coherence of various vision encoders, and based on this analysis, we propose Align4Gen, a novel multi-feature fusion and alignment method. Align4Gen demonstrates performance improvements in both conditional and unconditional video generation tasks.

Takeaways, Limitations

Takeaways:
A novel method for improving the feature representation capability of video diffusion models (Align4Gen).
Presenting criteria for selecting the optimal encoder through analysis of the video feature alignment suitability of various vision encoders.
Verify performance improvements in conditional and unconditional video generation tasks.
Limitations:
Further research is needed on the generalization performance of the proposed Align4Gen.
Lack of performance evaluation and analysis on diverse video datasets.
Lack of consideration for increased computational costs.
👍