This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Created by
Haebom
Author
Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li
Outline
Autoregressive speech token generation models produce diverse and natural-sounding speech, but their uncontrollability causes problems such as hallucinations and unwanted vocalizations. Koel-TTS is an improved encoder-decoder Transformer TTS model that addresses these issues by incorporating preference-alignment techniques with automatic speech recognition and speaker authentication models. It also further improves the synthesis adherence to transcriptions and reference speaker audio by incorporating classifier-free guidance. Experimental results show that these optimizations significantly improve the target speaker similarity, intelligibility, and naturalness of the synthesized speech, outperforming existing state-of-the-art TTS models despite being trained on a relatively small dataset.
Takeaways, Limitations
•
Takeaways:
◦
We significantly improve the controllability of TTS models and the quality of synthesized voices through a preference sorting technique and a classifier-free guidance technique utilizing automatic speech recognition and speaker authentication models.
◦
We demonstrate data efficiency by achieving state-of-the-art performance even with small datasets.
◦
Target speaker similarity, clarity, and naturalness were all improved.
•
Limitations:
◦
Since the size of the dataset used is not explicitly stated, there may be a lack of performance evaluation compared to other large datasets.
◦
A detailed analysis of the specific size of the “small dataset” mentioned in the paper and how it differs from other models is needed.
◦
There is a lack of analysis on whether there is bias against specific languages or speakers.