Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Adaptive Duration Model for Text Speech Alignment

Created by
  • Haebom

Author

Junjie Cao

Outline

This paper focuses on speech-to-text alignment, a crucial element in neural network-based text-to-speech (TTS) models. Autoregressive TTS models typically learn alignment online using an attention mechanism, while non-autoregressive end-to-end TTS models rely on durations extracted from external sources. In this paper, we propose a novel duration prediction framework that can provide promising phoneme-level duration distributions from given text. Experimental results demonstrate that the proposed duration model is more accurate and adaptive to conditions than existing baseline models. Specifically, it significantly improves phoneme-level alignment accuracy and makes zero-shot TTS models more robust to mismatches between prompt and input audio.

Takeaways, Limitations

Takeaways:
A new duration prediction framework provides more accurate phoneme-level duration prediction and condition adaptability than existing models.
It contributes to improving the accuracy of phoneme-level alignment and enhancing the robustness of zero-shot TTS models.
It can contribute to improving the performance of non-autoregressive end-to-end TTS models.
Limitations:
Further evaluation of the generalization performance of the proposed model is needed.
Experimental results for various language and speech data were not presented.
A more comprehensive comparative analysis with other duration prediction models is needed.
👍