In this paper, we present a performance evaluation method for large-scale language models (LLMs) using Chain-of-Thought (CoT) inference. Existing CoT evaluation techniques have limitations in that they require annotated CoT data or cannot accurately evaluate intermediate inference steps. In this paper, we formalize CoT inference of LLMs from an information-theoretic perspective and quantify the information gain of each inference step. We experimentally show that our method can identify failure modes of LLMs without expensive annotation data, and provides more accurate insights into model performance for individual subtasks than existing result-based methods on Toy arithmetic, GSM8K, and PRM800k datasets.