This paper highlights the importance of speech and music tokenization in multimodal large-scale language models (MLLMs) and points out shortcomings in existing research. They point out that existing research lacks adequate definitions of semantic and acoustic tokens, and that codec evaluations are biased toward specific domains or tasks (e.g., reconstruction or automatic speech recognition), making fair and comprehensive comparisons difficult. Therefore, this paper proposes appropriate definitions of semantic and acoustic tokens and a systematic evaluation framework to evaluate codec performance across four dimensions: acoustic reconstruction metrics, codebook index stability, decoder-specific transformer perplexity, and subtask performance. Experimental results demonstrate the validity of the proposed definitions and the correlations among the reconstruction metrics, codebook ID stability, subtask performance, and perplexity.