This paper demonstrates that Chain-of-Thought (CoT) prompting effectively improves the accuracy of large-scale language models (LLMs) in multi-step tasks, but it remains unclear whether the generated "thought processes" reflect the actual internal reasoning processes. The researchers present the first feature-level causal study on the reliability of CoT. Combining sparse autoencoders with activation patches, they extract single semantic features while the Pythia-70M and Pythia-2.8B models solve GSM8K math problems under CoT and regular (noCoT) prompting. Replacing a small set of CoT inference features with noCoT execution significantly increases the log-probability of answers in the 2.8B model, but has no reliable effect in the 70M model. This demonstrates a clear scaling threshold. CoT also significantly increases activation sparsity and feature interpretability scores in larger models, indicating more modular internal computation. For example, the confidence in the model's ability to generate correct answers improves from 1.2 to 4.3. By introducing patch curves and random feature patch baselines, we demonstrate that useful CoT information is not limited to the top K patches but is widely distributed. Overall, our results demonstrate that CoT induces a more interpretable internal structure in high-capacity LLM, validating its role as a structured prompting method.