This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Created by
Haebom
Author
Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, AKM Mahbubur Rahman
Outline
FuseCodec presents a novel approach to speech tokenization that integrates acoustic, semantic, and contextual representations. While existing neural codecs focus on capturing low-level acoustic features, FuseCodec improves the performance of spoken language modeling by incorporating semantic and contextual cues. This is achieved through three core techniques: latent representation fusion, global semantic-contextual supervision, and time-aligned contextual supervision. FuseCodec-TTS demonstrates its applicability to zero-shot speech synthesis, outperforming existing models on the LibriSpeech dataset.
Takeaways, Limitations
•
Takeaways:
◦
We improved the performance of speech tokenization by effectively integrating acoustic, semantic, and contextual representations.
◦
We demonstrated its applicability to zero-shot speech synthesis, demonstrating its wide-ranging potential.
◦
We demonstrate the effectiveness of our methodology by achieving excellent performance on the LibriSpeech dataset.
◦
We increased the reproducibility of our research by making the code and pre-trained models public.
•
Limitations:
◦
Detailed information about the specific model structure and parameter settings may be lacking.
◦
Generalization performance to other datasets and tasks requires further study.
◦
No mention was made of computational costs and training time.