Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Created by
  • Haebom

Author

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

Outline

ParsVoice is the largest Persian speech corpus designed for text-to-speech (TTS) applications. We built an automated pipeline to convert audiobook content into TTS-readable data, including a BERT-based sentence completion detector, a binary search boundary optimization method, and an audio-text quality assessment framework tailored to Persian. We processed 2,000 audiobooks to generate 3,526 hours of clean speech, which we filtered into a high-quality subset of 1,804 hours suitable for TTS. ParsVoice proved effective for training multi-speaker TTS systems. ParsVoice is publicly available to accelerate the development of Persian speech technology.

Takeaways, Limitations

Providing the largest high-quality Persian speech dataset
Provides audio quality similar to the English corpus with a variety of speakers
Fine-tuning XTTS on Persian to demonstrate its effectiveness on the dataset.
Accelerate the development of Persian speech technology by making the dataset publicly available.
The Limitations of the paper itself is not specified.
👍