[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis

Created by
  • Haebom

Author

Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen

Outline

In order to overcome the limitations of various synthetic speech analysis datasets, as the distinction between real and synthetic speech becomes increasingly important due to the increasing risk of fake information and identity theft, we propose a Speech-Forensics dataset that extensively covers real, synthetic, and partially faked speech samples, which contain multiple segments synthesized by various high-quality algorithms. In addition, we propose a TEmporal Speech Localization Network (TEST) that simultaneously performs authenticity verification, localization of multiple fake segments, and recognition of synthetic algorithms without complex post-processing. TEST effectively integrates LSTM and Transformer to extract robust temporal speech representations, and estimates synthetic segments using dense prediction on multi-scale pyramid features. The proposed model achieves an average mAP of 83.55% and EER of 5.25% at the utterance level, and an EER of 1.07% and an F1-score of 92.19% at the segment level, highlighting its robust capability for comprehensive analysis of synthetic speech.

Takeaways, Limitations

Takeaways:
We present a new Speech-Forensics dataset containing various types of synthetic speech generated by various high-quality algorithms.
Proposing an efficient TEST network that simultaneously performs authenticity verification, fake segment location detection, and synthetic algorithm recognition.
It represents a significant advance in the field of synthetic speech analysis, achieving high accuracy (utterance-level mAP 83.55%, EER 5.25%; segment-level EER 1.07%, F1 92.19%).
Provides a useful foundation for future synthetic voice analysis research and practical applications.
Limitations:
Lack of specific information about the size and diversity of the dataset (dataset size, types and proportions of different synthesis algorithms, etc.)
Additional verification of the proposed model's generalization performance is needed (resistance to various environments, noise, etc.)
Lack of performance evaluation on complex real-world speech data (e.g. background noise, overlapping, etc.)
👍