Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

작성자
  • Haebom

Author

Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Outline

This paper systematically investigates the role of speech tokenizer design in a spoken language model (SLM) and proposes improvements for effective cross-modal alignment between speech and text and high-quality speech generation. By adding speech head and speaker modeling to the LLM-centric SLM and comparing and analyzing conjugated, semi-conjugated, and fully unconjugated speech tokenizers, we find that unconjugated tokenization significantly improves alignment and synthesis quality. Furthermore, to address the information density mismatch between speech and text, we introduce multi-token prediction (MTP), which improves decoding speed by up to 12x and significantly reduces the word error rate from 6.07% to 3.01%. Finally, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities, to improve knowledge understanding and speaker consistency.

Takeaways, Limitations

Takeaways:
We demonstrate that a non-binding speech tokenizer is effective in improving the speech-to-text alignment and synthesis quality of SLM.
Significantly improves the decoding speed of SLM and reduces word error rate through multi-token prediction (MTP).
Improving knowledge understanding and speaker consistency through speaker recognition generation paradigms and the RoleTriviaQA benchmark.
Limitations:
Further validation of the scale and diversity of the RoleTriviaQA benchmark is needed.
Generalization performance evaluation of the proposed method on other SLM architectures and datasets is needed.
Further analysis of the computational complexity and memory usage of MTP is needed.
👍