This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, Maneesh Agrawala, Lu Jiang, Gordon Wetzstein
Outline
This paper reframes long-form video generation as a long-form context memory problem and proposes a learnable sparse attentional routing module called Mixed Context (MoC). MoC uses causal routing to avoid recursive path closure by dynamically selecting a few information-rich chunks and essential anchors (subtitles, local windows) for each query. By expanding the data size and progressively sparsifying the routing, the model allocates computation to important historical records, preserving identities, actions, and scenes across minutes of content. This brings efficiency as a byproduct of retrieval (near-linear scaling), enabling practical training and synthesis, and exhibits memory and consistency on the order of minutes. By addressing the quadratic cost of existing self-attention mechanisms through sparse attentional routing, MoC enables long-form video generation.
Takeaways, Limitations
•
Takeaways:
◦
We present an efficient method for generating long context videos (with nearly linear scalability).
◦
Addressed memory and consistency issues in long video generation.
◦
Solving the computational cost problem of self-attention mechanisms through sparse attention routing.
◦
Offers the possibility of creating long videos in minutes.
•
Limitations:
◦
The performance of the MoC module can be highly dependent on the data size and sparsity strategy.
◦
Further research is needed on the generalization performance of the proposed method.
◦
Further analysis is needed on the effectiveness and limitations of preventing circular path closure through causal path designation.
◦
A detailed analysis of the computational resources and memory requirements required for practical applications is required.