Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Created by
  • Haebom

Author

Yizhan Huang, Zhe Yang, Meifang Chen, Jianping Zhang, Michael R. Lyu

Outline

This paper addresses a fundamental question about the phenomenon of memorization of training data in large-scale language models (LLMs): how can we characterize the difficulty of memorizing training data? Through experiments using the OLMo family of open models, we propose the entropy-memorization law, which suggests that data entropy is linearly correlated with memorization scores. Furthermore, through a case study of memorizing highly randomized strings (gibberish), we observe that these strings, despite their apparent randomness, exhibit unexpectedly low empirical entropy compared to the extensive training corpus. Adopting the same strategy used to discover the entropy-memorization law, we derive Dataset Inference (DI), a simple yet effective approach for distinguishing training and test data.

Takeaways, Limitations

Takeaways:
We present the importance of data entropy in understanding the phenomenon of memorization of training data in LLM.
We demonstrate the possibility of predicting the memorization difficulty of training data through the entropy-memorization law.
We present a new technique called dataset inference (DI) that provides a way to distinguish between training and test data.
Limitations:
Since the results are based on experiments on a specific model family called OLMo, further research is needed to determine whether they can be generalized to other LLMs.
Further analysis of the strength and scope of the linear correlation of the entropy-memorization law is needed.
A broader evaluation of the performance and limitations of dataset inference (DI) is needed.
👍