Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

Created by
  • Haebom

Author

Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar, Tiantian Feng, Dongchao Yang, Junhyeok Lee, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Zengyi Qin, Shrikanth Narayanan, Mounya Elhiali, Najim Dehak

Outline

This paper proposes a new benchmark, CapSpeech, to contribute to the advancement of styled captioned text-to-speech (CapTTS) and to address the lack of a standardized dataset and limitations in further research. CapSpeech is designed to be used across various CapTTS-related tasks, including CapTTS-SE, AccCapTTS, EmoCapTTS, and AgentTTS. It contains over 10 million machine-annotated audio-caption pairs and nearly 360,000 human-annotated audio-caption pairs. We also present a new dataset, created by professional voice actors and audio engineers for AgentTTS and CapTTS-SE. We conduct extensive experiments using autoregressive and non-autoregressive models based on CapSpeech, demonstrating that it achieves high-quality, intelligible speech synthesis across a variety of speech styles.

Takeaways, Limitations

Takeaways:
Providing a large dataset for various tasks related to CapTTS (CapSpeech).
Provides insight into the challenges of developing a CapTTS system.
Achieving high-quality speech synthesis in a variety of voice styles.
Introducing new datasets for AgentTTS and CapTTS-SE.
Limitations:
Limitations, mentioned in the paper itself, is not specifically presented. (The content specified in the Abstract is insufficient.)
👍