Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Created by
  • Haebom

Author

Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, AKM Mahbubur Rahman

Outline

FuseCodec presents a novel approach to speech tokenization that integrates acoustic, semantic, and contextual representations. While existing neural codecs focus on capturing low-level acoustic features, FuseCodec improves the performance of spoken language modeling by incorporating semantic and contextual cues. This is achieved through three core techniques: latent representation fusion, global semantic-contextual supervision, and time-aligned contextual supervision. FuseCodec-TTS demonstrates its applicability to zero-shot speech synthesis, outperforming existing models on the LibriSpeech dataset.

Takeaways, Limitations

Takeaways:
We improved the performance of speech tokenization by effectively integrating acoustic, semantic, and contextual representations.
We demonstrated its applicability to zero-shot speech synthesis, demonstrating its wide-ranging potential.
We demonstrate the effectiveness of our methodology by achieving excellent performance on the LibriSpeech dataset.
We increased the reproducibility of our research by making the code and pre-trained models public.
Limitations:
Detailed information about the specific model structure and parameter settings may be lacking.
Generalization performance to other datasets and tasks requires further study.
No mention was made of computational costs and training time.
👍