Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Interpolating Speaker Identities in Embedding Space for Data Expansion

Created by
  • Haebom

Author

Tianchi Liu, Ruijie Tao, Qiongqiong Wang, Yidi Jiang, Hardik B. Sailor, Ke Zhang, Jingru Lin, Haizhou Li

Outline

To address the limitations of deep learning-based speaker authentication systems, which rely heavily on access to large, diverse speaker data sets, this paper proposes INSIDE (Interpolating Speaker Identities in Embedding Space), a novel data augmentation method that synthesizes new speaker IDs by interpolating between existing speaker embeddings. INSIDE selects pairs of nearby speaker embeddings from a pre-trained speaker embedding space and computes an intermediate embedding using spherical linear interpolation. These interpolated embeddings are fed into a speech synthesis system to generate corresponding speech waveforms. The resulting data is then combined with the original dataset to train submodels. Experimental results demonstrate that models trained with INSIDE-augmented data outperform models trained solely on real data, achieving relative performance gains of 3.06% to 5.24% on speaker authentication. Gender classification also demonstrates a 13.44% relative performance gain. INSIDE is compatible with other augmentation techniques, making it a flexible and scalable addition to existing training pipelines.

Takeaways, Limitations

Takeaways:
We present an effective data augmentation technique that can improve the performance of deep learning-based speaker authentication and related tasks even with limited data.
It also demonstrates applicability to other tasks, such as gender classification, in addition to speaker authentication.
A flexible and scalable method that can be easily integrated into existing training pipelines.
Limitations:
Interpolated speaker embeddings may not perfectly reflect the characteristics of real speakers.
The quality of the generated data may be affected by the performance of the speech synthesis system.
Additional considerations may be needed regarding privacy issues (such as the potential for personal information to be leaked during data synthesis).
👍