[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A2TTS: TTS for Low Resource Indian Languages

Created by
  • Haebom

Author

Ayush Singh Bhadoriya, Abhishek Nikunj Shinde, Isha Pandey, Ganesh Ramakrishnan

Outline

In this paper, we present a speaker-conditional text-to-speech (TTS) system that supports various Indian languages and solves the problem of speech generation for unregistered speakers. Using a diffusion-based TTS architecture, a speaker encoder extracts embeddings from short reference audio samples and conditions them to a DDPM decoder for multi-speaker generation. For better prosody and naturalness, a cross-attention-based duration prediction mechanism that leverages the reference audio is used to enable more accurate and speaker-consistent timing. This improves duration modeling and overall expressiveness while generating speech that is very similar to the target speaker. In addition, to improve zero-shot generation, a classifier-free guidance is used to enable more natural speech generation for unknown speakers. In this study, language-specific speaker-conditional models are trained for several Indian languages, including Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi, and Tamil, using the IndicSUPERB dataset.

Takeaways, Limitations

Takeaways:
We present a novel speaker-conditional TTS system that effectively addresses the problem of speech generation for unregistered speakers.
Ensuring linguistic diversity with support for various Indian languages.
Natural and expressive speech generation using diffusion models and cross-attention-based duration prediction mechanisms.
Improving zero-shot generation performance via a classifier-free guidance technique.
Limitations:
Performance dependent on IndicSUPERB dataset. Generalization performance on other datasets needs to be verified.
Lack of quantitative analysis of specific performance metrics (e.g. naturalness, clarity).
No mention of real-time speech generation capabilities.
Lack of detailed analysis of the performance of speaker encoders.
👍