Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech

Created by
  • Haebom

Author

Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu

Outline

IndexTTS2 is a proposed model to overcome the limitations of existing autoregressive-based large-scale text-to-speech (TTS) models, which offer excellent naturalness but have difficulty in duration control. It supports precise speech duration control through explicit token count specification and a free generation mode with an unspecified token count. It also allows independent control of timbre and emotion by separating emotional expression and speaker gender. It utilizes GPT latent representations to enhance the intelligibility of highly emotional speech, and a soft instruction mechanism based on Qwen3 fine-tuning enhances the convenience of emotion control. Experimental results on various datasets demonstrate that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity.

Takeaways, Limitations

Takeaways:
Solving the problem of precise speech duration control in autoregressive-based TTS models.
Independent control of tone and emotion
High-accuracy tone and emotional reproduction in a zero-shot environment
Improving the clarity of emotionally expressive speech through the use of GPT latent expressions.
Increased convenience of emotional control through soft instruction mechanisms
Achieving cutting-edge performance across a variety of evaluation metrics
Limitations:
Limitations is not explicitly mentioned in the paper. Further experiments or performance verification using various datasets may be required.
👍