This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
In this paper, we analyze the limitations of speech codecs that act as a bridge between speech signals and large-scale language models, and propose a novel codec, XY-Tokenizer, that considers both semantic and acoustic information. XY-Tokenizer mitigates the tradeoff between semantic and acoustic capabilities through multi-stage multi-task learning. Experimental results show that XY-Tokenizer achieves comparable performance on both semantic and acoustic tasks compared to state-of-the-art codecs operating at similar bitrates. In particular, it achieves robust text alignment that outperforms distillation-based semantic modeling methods such as SpeechTokenizer and Mimi, while maintaining a speaker similarity score of 0.83 between the reconstructed audio and the original audio. The reconstruction performance is comparable to that of BigCodec, a state-of-the-art audio-only codec (speaker similarity score of 0.84). The code and model are available at https://github.com/gyt1145028706/XY-Tokenizer .