This paper analyzes the performance improvement of Large Language Models (LLMs) through Chain-of-Thought (CoT) prompting from a data distribution perspective. We investigate whether CoT inference reflects structural inductive biases learned from training data, enabling conditional generation that approximates the inference paths observed during training. To achieve this, we design DataAlchemy, a controlled environment where we train LLMs from scratch and systematically investigate various distributional conditions. We analyze CoT inference across three dimensions: task, length, and format. Our results reveal that CoT inference is a fragile phenomenon that disappears outside the training distribution, highlighting the difficulty of achieving truly generalizable inference.