This paper proposes the Factual Robustness Score (FRS), a novel metric for assessing the robustness of factual knowledge in large-scale language models (LLMs). While existing evaluation methods primarily focus on performance-based metrics and the external impact of prompt changes, this paper presents a principled approach to measuring factual robustness during the generation process itself by analyzing token distribution entropy and temperature scaling sensitivity. Experiments on five LLMs and three closed-ended question-answering datasets (SQuAD, TriviaQA, and HotpotQA) demonstrate that factual robustness varies significantly with model size (0.76 for small models and 0.93 for large models), with accuracy decreasing by approximately 60% as uncertainty increases. This analysis demonstrates the impact of entropy and temperature scaling on factual accuracy, laying the foundation for the development of models with more robust knowledge retention and retrieval capabilities.