Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DiTraj: training-free trajectory control for video diffusion transformer

Created by
  • Haebom

Author

Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao

Outline

This paper proposes DiTraj, a simple and effective training-free framework for trajectory control in text-to-video generation using a 3D full attention-based Diffusion Transformer (DiT) video generation model. DiTraj decouples user-provided prompts into foreground and background prompts via LLM to guide the generation of foreground and background regions in the video. Furthermore, it proposes inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE) to enhance trajectory control. STD-RoPE removes cross-frame spatial mismatch by modifying only the position embedding of the foreground token and adjusts the density of the position embedding for 3D-aware trajectory control. Experimental results demonstrate that the proposed method outperforms existing methods in both video quality and trajectory control performance.

Takeaways, Limitations

Takeaways:
Enabling training-free trajectory control in DiT-based video generation models.
A trajectory control framework is proposed using LLM to separate foreground/background.
Improving trajectory control performance by enhancing cross-frame attention through STD-RoPE.
Proposal of a position embedding density control technique for 3D-aware trajectory control.
Demonstrated improved video quality and trajectory control performance compared to existing methods.
Limitations:
Additional computational costs may apply when using LLM.
Further research is needed on the generalizability of STD-RoPE and its application to other DiT-based models.
Performance evaluation is required in various trajectory types and complex scenes.
👍