Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing

Created by
  • Haebom

Author

Abdelilah Aitrouga, Youssef Hmamouche, Amal El Fallah Seghrouchni

Outline

Recent advances in video editing have brought deep learning models focused on spatiotemporal dependencies into the mainstream. However, these models have been limited in their ability to adapt to long-duration and high-resolution videos due to the quadratic computational complexity of existing attention mechanisms. To address this issue, this paper proposes VRWKV-Editor, a novel video editing model that integrates a linear spatiotemporal aggregation module into a video-based diffusion model. VRWKV-Editor leverages the bidirectional weighted key-value recurrence mechanism of the RWKV transformer to capture global dependencies while maintaining temporal consistency, achieving linear complexity without compromising quality.

Takeaways, Limitations

Takeaways:
VRWKV-Editor achieves up to 3.7x speedup and 60% memory usage reduction compared to state-of-the-art diffusion-based video editing methods.
Maintains competitive performance in terms of frame consistency and text alignment.
In longer videos, we see an even larger editing speed gap compared to our own attention-based architecture.
Limitations:
The Limitations mentioned in the paper is not directly addressed. (However, there may be areas where performance is inferior to existing SOTA models, and specific mention of these areas may not be made in the paper.)
👍