Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs

Created by
  • Haebom

Author

Yitian Gong, Luozhijie Jin, Ruifan Deng, Dong Zhang, Xin Zhang, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Xipeng Qiu

Outline

In this paper, we analyze the limitations of speech codecs that act as a bridge between speech signals and large-scale language models, and propose a novel codec, XY-Tokenizer, that considers both semantic and acoustic information. XY-Tokenizer mitigates the tradeoff between semantic and acoustic capabilities through multi-stage multi-task learning. Experimental results show that XY-Tokenizer achieves comparable performance on both semantic and acoustic tasks compared to state-of-the-art codecs operating at similar bitrates. In particular, it achieves robust text alignment that outperforms distillation-based semantic modeling methods such as SpeechTokenizer and Mimi, while maintaining a speaker similarity score of 0.83 between the reconstructed audio and the original audio. The reconstruction performance is comparable to that of BigCodec, a state-of-the-art audio-only codec (speaker similarity score of 0.84). The code and model are available at https://github.com/gyt1145028706/XY-Tokenizer .

Takeaways, Limitations

Takeaways:
Proposing a new speech codec XY-Tokenizer that considers both semantic and acoustic information
Improving semantic and acoustic performance through multi-stage multi-task learning
Achieves competitive performance compared to existing state-of-the-art codecs (text alignment, speaker similarity)
Open source release for improved accessibility
Limitations:
There is no specific mention of __T83488_____ presented in the paper.
Generalizability is questionable due to lack of detailed description of the experimental environment and dataset.
Lack of in-depth analysis of factors contributing to the performance improvement of XY-Tokenizer.
👍