Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

From Confidence to Collapse in LLM Factual Robustness

Created by
  • Haebom

Author

Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci

Outline

This paper proposes the Factual Robustness Score (FRS), a novel metric for assessing the robustness of factual knowledge in large-scale language models (LLMs). While existing evaluation methods primarily focus on performance-based metrics and the external impact of prompt changes, this paper presents a principled approach to measuring factual robustness during the generation process itself by analyzing token distribution entropy and temperature scaling sensitivity. Experiments on five LLMs and three closed-ended question-answering datasets (SQuAD, TriviaQA, and HotpotQA) demonstrate that factual robustness varies significantly with model size (0.76 for small models and 0.93 for large models), with accuracy decreasing by approximately 60% as uncertainty increases. This analysis demonstrates the impact of entropy and temperature scaling on factual accuracy, laying the foundation for the development of models with more robust knowledge retention and retrieval capabilities.

Takeaways, Limitations

Takeaways:
We present FRS, a new index for evaluating the factual knowledge robustness of LLM.
A new evaluation method focusing on the creation process itself is presented.
Revealing the correlation between model size and realistic robustness.
Identify the phenomenon of accuracy deterioration due to increased uncertainty.
Establishing a foundation that can contribute to improving the knowledge retention and retrieval capabilities of LLM in the future.
Limitations:
Further research is needed to determine the generalizability of the proposed FRS indicators.
Further experiments are needed on different types of LLMs and datasets.
Further research is needed to improve and supplement the FRS indicators.
👍