Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation

Created by
  • Haebom

Author

Lu Wang, Hao Chen, Siyu Wu, Zhiyue Wu, Hao Zhou, Chengfeng Zhang, Ting Wang, Haodi Zhang

Outline

This paper highlights the importance of speech and music tokenization in multimodal large-scale language models (MLLMs) and points out shortcomings in existing research. They point out that existing research lacks adequate definitions of semantic and acoustic tokens, and that codec evaluations are biased toward specific domains or tasks (e.g., reconstruction or automatic speech recognition), making fair and comprehensive comparisons difficult. Therefore, this paper proposes appropriate definitions of semantic and acoustic tokens and a systematic evaluation framework to evaluate codec performance across four dimensions: acoustic reconstruction metrics, codebook index stability, decoder-specific transformer perplexity, and subtask performance. Experimental results demonstrate the validity of the proposed definitions and the correlations among the reconstruction metrics, codebook ID stability, subtask performance, and perplexity.

Takeaways, Limitations

Takeaways:
Contributed to research on phonetic and musical tokenization in MLLM by providing clear definitions of semantic and acoustic tokens.
Establishing a foundation for comprehensive comparison and evaluation of codec performance through a multidimensional evaluation framework.
Provides insights into codec design and optimization by identifying correlations between reconstruction metrics, codebook ID stability, subtask performance, and perplexity.
Limitations:
Further research is needed to determine the versatility of the proposed evaluation framework and its generalizability to various speech and music datasets.
There is a possibility of bias in the evaluation due to limitations in the type and number of subtasks used in the evaluation.
It cannot be ruled out that there may be a bias towards certain codecs or models.
👍