Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

EAI-Avatar: Emotion-Aware Interactive Talking Head Generation

Created by
  • Haebom

Author

Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

Outline

This paper proposes EAI-Avatar, a novel conversational avatar generation framework for recognizing emotions in two-way conversational situations. To overcome the limitations of existing one-way portrait animation generation methods, we leverage the dialogue generation capabilities of large-scale language models (LLMs, e.g., GPT-4) to generate virtual avatars with rich, temporally consistent emotional changes. Specifically, we design a Transformer-based head mask generator that learns temporally consistent motion features in a latent mask space. This allows us to generate temporally consistent mask sequences of arbitrary length to control head movements. Furthermore, we introduce an interactive dialogue tree structure, where each node represents child/parent/sibling node information and the current character's emotional state, thereby representing conversational state transitions. Through reverse level traversal, we extract rich past emotional cues from the current node to guide facial expression synthesis. Extensive experiments demonstrate the superior performance and effectiveness of the proposed method.

Takeaways, Limitations

Takeaways:
We present a technique for generating emotionally rich and temporally consistent interactive avatars in two-way conversation situations.
Presenting the possibility of generating real-time or near-real-time animations through an efficient architecture based on LLM and Transformer.
Effective use of emotional information enables creation of more realistic and immersive avatars.
Limitations:
Further research is needed on the real-time performance and scalability of the proposed method.
Need to assess and improve generalization ability across various emotional expressions and conversational contexts.
The difficulty of obtaining high-resolution, high-quality data for creating realistic avatars.
Due to the high dependence on LLM, the quality of avatar generation may be affected by the performance of LLM.
👍