Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Vocoder-Projected Feature Discriminator

Created by
  • Haebom

Author

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

Outline

This paper highlights the limitations of existing approaches that utilize acoustic features, such as Mel Spectrograms, to generate high-quality speech signals in speech synthesis (TTS) and voice conversion (VC). Existing approaches utilize a vocoder to convert acoustic features into speech signals and apply adversarial training in the time domain, but upsampling the speech signals incurs significant time and memory overhead. To address this, we propose a Vocoder Projection Feature Discriminator (VPFD) that utilizes vocoder features. Using a pre-trained, fixed vocoder feature extractor and a single upsampling step, we demonstrate that VPFD achieves comparable VC performance to the speech discriminator while reducing training time and memory consumption by 9.6x and 11.4x, respectively, through diffusion-based VC distillation experiments.

Takeaways, Limitations

Takeaways:
We demonstrate that adversarial training using vocoder features can significantly reduce the training time and memory consumption of speech synthesis and voice conversion.
We present the possibility of building efficient speech generation models by leveraging pre-trained vocoders.
Experimental verification of the effectiveness of a vocoder projection feature discriminator (VPFD) that exhibits performance similar to that of a sound discriminator.
Limitations:
The performance of the proposed method may be limited to certain diffusion-based VC distillation settings.
Further research is needed on generalization performance to other speech synthesis and voice conversion models or datasets.
May depend on the performance of the pre-trained vocoder.
👍