Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Created by
  • Haebom

Author

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen

Outline

This paper proposes EmoVoice, a novel TTS model capable of controlling emotional expression. EmoVoice leverages a large-scale language model (LLM) to enable free and granular natural language emotion control. Furthermore, inspired by Chain of Thought (CoT) and Chain of Modality (CoM) techniques, it enhances content consistency by designing a phoneme boosting variant that outputs phoneme tokens and audio tokens in parallel. We also introduce EmoVoice-DB, a high-quality 40-hour English emotional dataset containing expressive speech, detailed emotional labels, and natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our own data. Furthermore, we investigate the reliability of existing emotional assessment metrics and their alignment with human perceptual preferences, and evaluate emotional speech using GPT-4o-audio and Gemini, two state-of-the-art multimodal LLMs. The dataset, code, checkpoints, and demo samples are available on GitHub.

Takeaways, Limitations

Takeaways:
We propose EmoVoice, a TTS model capable of free and detailed natural language emotion control using LLM.
Improving content consistency through phoneme boost transformation design.
High-quality English sentiment dataset EmoVoice-DB released.
Achieving cutting-edge performance with synthetic data alone.
A study on the reliability of existing emotional assessment indicators and their alignment with human perceptual preferences.
Emotional speech assessment using state-of-the-art multimodal LLM.
Ensuring research reproducibility through open access to code, datasets, checkpoints, and demo samples.
Limitations:
EmoVoice-DB is English-centric, which may limit its generalizability to other languages.
Since it was trained only with synthetic data, comparative studies with training results using real voice data are needed.
Further research is needed on the limitations of existing emotional assessment indicators and the development of more sophisticated assessment methodologies.
Reliability verification of LLM evaluation results such as GPT-4o-audio and Gemini is required.
👍