Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

Created by
  • Haebom

Author

Guoxian Song, Hongyi Xu, Xiaochen Zhao, You Xie, Tianpei Gu, Zenan Li, Chenxu Zhang, Linjie Luo

Outline

X - UniMotion is an integrated and expressive implicit latent representation for full-body human motion, including facial expressions, body poses, and hand gestures. Unlike existing motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, this work directly encodes motions of various scales in a single image into four separate latent tokens: one for each face, one for each body pose, and one for each hand. These motion latent variables are highly expressive and identity-independent, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we present an end-to-end framework based on self-supervised learning that jointly trains the motion encoder and latent representations, along with a DiT-based video generation model trained on a large, diverse human motion dataset. Motion-identity separation is enhanced through 2D spatial and color augmentation and synthetic 3D rendering of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning using an auxiliary decoder that facilitates fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments demonstrate that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with excellent motion fidelity and identity preservation.

Takeaways, Limitations

Takeaways:
We present a novel implicit latent representation method that effectively represents full-body human motion, including facial expressions, body postures, and hand gestures, from a single image.
Enables high-fidelity, detailed cross-identity motion transfer.
It outperforms state-of-the-art methods.
Efficient learning possible through an end-to-end framework based on self-supervised learning.
Limitations:
Reliance on large and diverse human motion datasets.
Limitations of training data using synthetic 3D rendering.
Further validation of generalization performance on real-world data is needed.
Further research is needed to understand the interpretability of potential tokens.
👍