Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

QR-VC: Leveraging Quantization Residuals for Linear Disentanglement in Zero-Shot Voice Conversion

Created by
  • Haebom

Author

Youngjun Sim, Jinsung Yoon, Wooyeol Jeong, Young-Joo Suh

Outline

This paper presents a zero-shot speech conversion technique that converts speaker components of input speech into target speaker components without additional training, using only a single reference utterance. Previous research focused on extracting high-quality content representations and removing speaker information using self-supervised learning features and K-means quantization. However, this process often results in the loss of fine-grained phonetic and prosodic information, which degrades intelligibility and prosodic retention. This paper presents a novel method that effectively separates speaker information from phonetic and prosodic information by taking temporal characteristics into account using quantization residuals. Using only K-means quantization and linear projection, we achieve simple yet effective separation without complex structures or explicit supervised learning, and enable high-quality speech conversion using only reconstruction loss. Experimental results demonstrate that the proposed model outperforms existing methods in both subjective and objective metrics, improving intelligibility, speaker similarity, and prosodic retention.

Takeaways, Limitations

Takeaways:
A novel method for improving speech conversion performance by utilizing K-means quantization residuals is presented.
High-quality zero-shot speech conversion without complex structures or explicit supervised learning.
Achieved improved performance in clarity, speaker similarity, and prosody retention.
Demonstrating the effectiveness of the Linear Disentangler module.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Performance evaluation on various language and speech datasets is needed.
There is a possibility of performance degradation due to the limitations of K-means quantization.
👍