Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

Created by
  • Haebom

Author

Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, Xiaoqiang Liu, Pengfei Wan

Outline

This paper presents a framework for real-time interactive digital human video generation. To address the high computational cost and limited controllability of existing methods, we propose an autoregressive video generation method capable of low-latency inference. With minimal modification to a large-scale language model (LLM), it accepts various conditional encodings, including audio, pose, and text, and outputs spatially and semantically consistent representations that guide the denoising process of a diffusion model. A large-scale conversation dataset of approximately 20,000 hours is constructed for model training, and a deep compressive autoencoder with up to 64x compression ratios is introduced to effectively reduce the long-term inference load of the autoregressive model. This approach demonstrates low latency, high efficiency, and fine-grained multimodal controllability in various experiments, including two-way conversation, multilingual human synthesis, and interactive world models.

Takeaways, Limitations

Takeaways:
Presenting new possibilities for creating digital human videos that can interact in real time.
Sophisticated controllability using various modalities (audio, pose, text)
Achieving low-latency, high-efficiency inference using deep compressive autoencoders.
Reflecting real-world conversation scenarios by building a large-scale conversation dataset
Limitations:
Further evaluation of the generalization performance of the proposed method is needed.
Analysis is needed to address the potential information loss and image quality degradation that may occur during the compression process.
Lack of detailed description of the composition and quality of the 20,000-hour conversation dataset.
A more detailed comparative analysis with other state-of-the-art methods is needed.
👍