This paper introduces Multi-Scale Probabilistic Generation Theory (MSPGT), developed to explain the mechanisms of large-scale language models (LLMs). MSPGT models LLMs as Hierarchical Variational Information Bottleneck (H-VIB) systems and argues that standard language modeling objectives implicitly optimize multi-scale information compression. This assumption leads to the spontaneous formation of three internal processing scales: Global, Intermediate, and Local. MSPGT formalizes these principles and derives testable predictions about boundary locations and architectural dependencies. Experiments using the Llama and Qwen family of models validate MSPGT's predictions, revealing inter-model consistency and architecture-specific variability. MSPGT provides a predictive and information-theoretic understanding of how hierarchical structures within LLMs arise.