[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis

Created by
  • Haebom

Author

Alexandre Symeonidis-Herzig, Ozge Mercano\u{g}lu Sincan, Richard Bowden

Outline

This paper addresses the generation of realistic and high-fidelity 3D facial animations for expressive avatar systems in the areas of human-computer interaction and accessibility. To overcome the limitations of existing methods due to mesh domain dependence, this paper proposes VisualSpeaker, a novel method using optical differentiable rendering with supervised learning of optical speech recognition. VisualSpeaker uses a perceptual lip-reading loss function obtained by passing optical 3D Gaussian Splatting avatar rendering through a pre-trained Visual Automatic Speech Recognition model. Evaluation results on the MEAD dataset show that VisualSpeaker improves the standard Lip Vertex Error metric by 56.1%, improving the perceptual quality of the generated animations while maintaining the controllability of mesh-based animations. In particular, perceptual focus supports accurate mouth shapes, providing essential cues for distinguishing similar hand signals in sign language avatars.

Takeaways, Limitations

Takeaways:
Presenting a novel way to effectively leverage advances in 2D computer vision and graphics for 3D facial animation.
Introducing a perceptual lip-reading loss function to generate more realistic and natural 3D facial animations than existing methods.
Improved Lip Vertex Error indicator and improved perceptual quality, increasing usability in various applications such as sign language avatars.
Maintain controllability of mesh-based animations.
Limitations:
Only evaluation results for the MEAD dataset are presented, so generalization performance for other datasets is uncertain.
May depend on the performance of the Visual Automatic Speech Recognition model.
Gaussian Splatting rendering technique can be computationally expensive.
👍