Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for \"U-Tsang, Amdo and Kham Speech Dataset Generation

Created by
  • Haebom

Author

Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi

Outline

FMSD-TTS is a few-shot, multi-speaker, multi-dialect text-to-speech framework proposed to address the challenge of low-resource languages lacking parallel speech corpora for the three major Tibetan dialects (U-Tsang, Amdo, and Kham). It synthesizes parallel dialect speech using limited reference audio and explicit dialect labels. It captures subtle acoustic and linguistic variations between dialects while preserving speaker identity through a speaker-dialect fusion module and a dialect-specific dynamic routing network (DSDR-Net). Objective and subjective evaluations demonstrate significant improvements in dialect expressivity and speaker similarity compared to baseline models. Furthermore, the quality and usability of the synthesized speech are verified through a challenging speech-to-speech dialect conversion task. Key contributions include the implementation of a few-shot Tibetan multi-dialect speech synthesis system, the release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and an open-source evaluation tool for standardized evaluation of speaker similarity, dialect consistency, and audio quality.

Takeaways, Limitations

Takeaways:
An effective solution to the problem of multi-dialect speech synthesis in Tibetan, a low-resource language.
Contributing to future research by releasing a large-scale synthetic Tibetan speech corpus generated through FMSD-TTS.
Contributing to the standardization and advancement of multi-dialect speech synthesis research by providing open-source evaluation tools.
Achieving high performance with less data through few-shot learning.
Limitations:
Currently available information is insufficient to provide specific details on the performance limitations of FMSD-TTS.
Further research is needed to determine generalizability to other low-resource languages.
A more in-depth analysis of the naturalness of synthetic voices is needed.
👍