Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SoloSpeech: Enhancing Intelligence and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Created by
  • Haebom

Author

Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak

Outline

This paper addresses the problem of Target Speech Extraction (TSE), which involves isolating a specific speaker's speech from a multi-speaker speech mixture. Existing TSE methods primarily utilize discriminative models, which offer high recognition quality. However, they suffer from issues such as artifacts, reduced naturalness, and sensitivity to mismatches between training and testing environments. Generative models, on the other hand, suffer from low recognition quality and intelligibility. In this paper, we propose SoloSpeech, a novel cascaded generation pipeline that integrates compression, extraction, reconstruction, and correction processes. Instead of relying on speaker embeddings, SoloSpeech utilizes conditional information from the latent space of cue audio to align it with the latent space of the mixed audio, thereby avoiding mismatches. Evaluation on the Libri2Mix dataset reveals that SoloSpeech outperforms existing state-of-the-art methods in both intelligibility and quality, and demonstrates excellent generalization performance to non-domain data and real-world settings.

Takeaways, Limitations

Takeaways:
We present a novel TSE method that achieves high performance without speaker embedding.
Improvement of artifact generation, naturalness degradation, and domain adaptation problems of existing methods Limitations.
Achieved new state-of-the-art performance on the Libri2Mix dataset.
Demonstrated excellent generalization performance on out-of-domain data and real-world environments.
Limitations:
Lack of analysis of the computational cost and complexity of SoloSpeech.
Lack of robustness assessment for various noise environments.
Lack of evaluation of additional datasets beyond real-world datasets.
👍