This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
This paper proposes a new benchmark, CapSpeech, to contribute to the advancement of styled captioned text-to-speech (CapTTS) and to address the lack of a standardized dataset and limitations in further research. CapSpeech is designed to be used across various CapTTS-related tasks, including CapTTS-SE, AccCapTTS, EmoCapTTS, and AgentTTS. It contains over 10 million machine-annotated audio-caption pairs and nearly 360,000 human-annotated audio-caption pairs. We also present a new dataset, created by professional voice actors and audio engineers for AgentTTS and CapTTS-SE. We conduct extensive experiments using autoregressive and non-autoregressive models based on CapSpeech, demonstrating that it achieves high-quality, intelligible speech synthesis across a variety of speech styles.
Takeaways, Limitations
•
Takeaways:
◦
Providing a large dataset for various tasks related to CapTTS (CapSpeech).
◦
Provides insight into the challenges of developing a CapTTS system.
◦
Achieving high-quality speech synthesis in a variety of voice styles.
◦
Introducing new datasets for AgentTTS and CapTTS-SE.
•
Limitations:
◦
Limitations, mentioned in the paper itself, is not specifically presented. (The content specified in the Abstract is insufficient.)