Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VibeVoice Technical Report

Created by
  • Haebom

Author

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei

Outline

VibeVoice is a novel model that synthesizes long-duration speech from multiple speakers using next-token diffusion. It leverages next-token diffusion, a unified method that autoregressively generates latent vectors to model continuous data. By introducing a novel continuous speech tokenizer that delivers 80x better data compression than existing Encodec models, VibeVoice significantly improves the computational efficiency of long-duration sequence processing while maintaining audio fidelity. As a result, VibeVoice can synthesize long-duration speech (64K context window length) from up to four speakers, achieving a realistic conversational atmosphere that surpasses open-source and commercial conversation models.

Takeaways, Limitations

Takeaways:
We present an efficient long-term multi-talker speech synthesis model based on the following token diffusion.
Development of a new continuous speech tokenizer with a data compression ratio 80 times better than existing models.
High-quality multi-talker voice synthesis up to 90 minutes long.
Implementing an improved conversational atmosphere compared to open source and commercial models.
Limitations:
The paper does not present specific performance evaluation metrics (e.g., sound quality, naturalness).
Possible limitations in synthesis time due to the 64K context window length limit.
Performance for more than 4 speakers has not been confirmed.
Lack of information about the model's training data and specific architecture.
👍