Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models

Created by
  • Haebom

Author

Huijie Liu, Jingyun Wang, Shuai Ma, Jie Hu, Xiaoming Wei, Guoliang Kang

Outline

This paper addresses motion customization, which generates videos with motion concepts specified by a set of video clips with the same motion concept using a diffusion model (DM). Previous studies have explored various methods for representing and embedding motion concepts into large-scale pretrained text-to-video diffusion models (e.g., learning motion LoRAs and using latent noise residuals). However, these methods inevitably encode the appearance of reference videos, which weakens the appearance generation capability. This paper follows the common approach of learning motion LoRAs to encode motion concepts, but proposes two novel strategies: temporal attention refinement (TAP) and appearance highways (AH) to improve action-appearance separation. In TAP, we assume that pretrained value embeddings are sufficient building blocks for generating new motions. We reconstruct the value embeddings by reconstructing temporal attention solely from motion LoRAs to generate new motions. In AH, we change the starting point of each skip connection in the U-Net from the output of each temporal attention module to the output of each spatial attention module. Experimental results show that the proposed method can generate videos with appearances more consistent with text descriptions and motions more consistent with reference videos than existing studies.

Takeaways, Limitations

Takeaways: We demonstrate that temporal attention refinement (TAP) and appearance highway (AH) strategies achieve better action-appearance separation than existing methods, enabling video generation with appearances consistent with text descriptions and actions consistent with reference videos. This contributes to the field of motion customization using diffusion models.
Limitations: The effectiveness of the TAP and AH strategies may be limited to certain types of diffusion models and datasets. Additional experiments are needed on a variety of diffusion models and datasets. Furthermore, generalization performance evaluations for videos with extremely complex or diverse motions are needed.
👍