Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Do Not Mimic My Voice: Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Created by
  • Haebom

Author

Taesoo Kim, Jinju Kim, Dongchan Kim, Jong Hwan Ko, Gyeong-Moon Park

Outline

This paper addresses privacy and ethical concerns raised by the rapid development of Zero-Shot Text-to-Speech (ZS-TTS) technology, particularly the potential for unwanted individual voice cloning. To address this, we propose a method for selectively removing speaker information from a ZS-TTS system. Specifically, we propose a novel machine learning unlearning framework, Teacher-Guided Unlearning (TGU), which trains a model to forget the voice of a specific speaker while retaining the ability to generate voices from other speakers. Furthermore, we introduce randomness to ensure that the forgotten speaker's voice cannot be traced, and we propose a new evaluation metric, Speaker-Zero Retrain Forgetting (spk-ZRF), to evaluate the model's ability to ignore prompts related to the forgotten speaker. Experimental results demonstrate that TGU prevents voice cloning for the forgotten speaker while maintaining the speech quality of other speakers.

Takeaways, Limitations

Takeaways:
A novel approach to addressing privacy and ethical concerns in ZS-TTS systems.
A proposal for an effective speaker information removal method using the Teacher-Guided Unlearning (TGU) framework.
A new evaluation metric, spk-ZRF, enables accurate measurement of the model's speaker information removal performance.
Helps prevent unwanted voice duplication and improve privacy.
Limitations:
Further research is needed to evaluate the generalization performance of the proposed method and its applicability to various ZS-TTS models.
Additional performance verification using evaluation metrics other than the spk-ZRF metric is required.
Further analysis is needed to address the difficulty of completely removing speaker information and the possibility of residual information.
👍