This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
DiTraj: training-free trajectory control for video diffusion transformer
Created by
Haebom
Author
Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao
Outline
This paper proposes DiTraj, a simple and effective training-free framework for trajectory control in text-to-video generation using a 3D full attention-based Diffusion Transformer (DiT) video generation model. DiTraj decouples user-provided prompts into foreground and background prompts via LLM to guide the generation of foreground and background regions in the video. Furthermore, it proposes inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE) to enhance trajectory control. STD-RoPE removes cross-frame spatial mismatch by modifying only the position embedding of the foreground token and adjusts the density of the position embedding for 3D-aware trajectory control. Experimental results demonstrate that the proposed method outperforms existing methods in both video quality and trajectory control performance.
Takeaways, Limitations
•
Takeaways:
◦
Enabling training-free trajectory control in DiT-based video generation models.
◦
A trajectory control framework is proposed using LLM to separate foreground/background.
◦
Improving trajectory control performance by enhancing cross-frame attention through STD-RoPE.
◦
Proposal of a position embedding density control technique for 3D-aware trajectory control.
◦
Demonstrated improved video quality and trajectory control performance compared to existing methods.
•
Limitations:
◦
Additional computational costs may apply when using LLM.
◦
Further research is needed on the generalizability of STD-RoPE and its application to other DiT-based models.
◦
Performance evaluation is required in various trajectory types and complex scenes.