To address the growing challenge of evaluating large-scale language models (LLMs), this paper proposes a novel evaluation framework, the Structured Transition Evaluation Method (STEM). STEM analyzes the performance variations of LLMs with identical architectures but different parameter sizes to identify significant transition samples (STS). These STS are then used to efficiently and interpretably estimate the performance of unknown models. Using the Qwen3 model, we build a pool of STSs across six diverse benchmarks. Experimental results demonstrate that STEM reliably captures model performance trends and matches ground-truth performance rankings. This highlights STEM as a practical and scalable method for fine-tuning and architecture-independent evaluation of LLMs.