Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

TTS-1 Technical Report

Created by
  • Haebom

Author

Oleg Atamanenko, Anna Chalova, Joseph Coombes, Nikki Cope, Phillip Dang, Zhifeng Deng, Jimmy Du, Michael Ermolenko, Feifan Fan, Yufei Feng, Cheryl Fichter, Pavel Filimonov, Louis Fischer, Kylan Gibbs, Valeria Gusarova, Pavel Karpik, Andreas Assad Kottner, Ian Lee, Oliver Louie, Jasmine Mai, Mikhail Mamontov, Suri Mao, Nurullah Morshed, Igor Poletaev, Florin Radu, Dmytro Semernia, Evgenii Shingarev, Vikram Sivaraja, Peter Skirko, Rinat Takhautdinov, Robert Villahermosa, Jean Wang

Outline

Inworld TTS-1 is a set of two Transformer-based autoregressive text-to-speech (TTS) models. TTS-1-Max, a larger model with 8.8 billion parameters, is designed for the highest quality and expressiveness in demanding applications. TTS-1, with 1.6 billion parameters, is the most efficient model ever created for real-time speech synthesis and on-device use cases. By scaling training-time computation and applying a sequential process of pre-training, fine-tuning, and RL alignment of the Speech Language Model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality in reproducing speaker speech solely through contextual learning. Inworld TTS-1 and TTS-1-Max can generate high-resolution, 48 kHz speech with low latency, support 11 languages via audio markup, and support detailed emotional control and non-verbal vocalizations. Furthermore, the training and modeling code is open-sourced under the MIT License.

Takeaways, Limitations

Takeaways:
We offer two model sizes: 8.8 billion and 1.6 billion parameters, providing flexibility for different use cases.
Provides an efficient model (TTS-1) suitable for real-time speech synthesis and on-device use.
Support for 11 languages, detailed emotional control, and non-verbal vocalization make it suitable for a variety of applications.
High-quality voice generation with high resolution (48kHz).
We accelerate research and development by releasing open source code.
Outstanding performance in reproducing the speaker's voice by relying purely on contextual learning.
Limitations:
The paper lacks any specific mention of Limitations or any performance degradation cases.
A detailed analysis of performance and efficiency in a specific hardware environment is required.
There is a lack of discussion about the model's bias or ethical issues.
A specific list of supported languages is not explicitly provided.
👍