Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

Created by
  • Haebom

Author

Yizhan Huang, Zhe Yang, Meifang Chen, Huang Nianchen, Jianping Zhang, and Michael R. Lyu

Outline

We conducted research on the phenomenon of memorization of training data in large-scale language models (LLMs). Specifically, we explored methods to characterize the difficulty of memorizing data, conducted experiments on the OLMo model, and proposed the entropy-memorization law. According to this law, there is a linear correlation between data entropy and memorization scores. Furthermore, through experiments on memorizing random strings (gibberish), we confirmed that random strings have lower entropy than training data. Based on these results, we developed a simple and effective dataset inference (DI) method that distinguishes between training and test data.

Takeaways, Limitations

Takeaways:
We discovered a new correlation between data entropy and LLM memorization ability, suggesting the possibility of predicting memorization difficulty.
Development of a training/test data distinction technique (Dataset Inference) using data entropy.
Experiments with random string memorization show that data complexity does not necessarily equate to memorization difficulty.
Limitations:
Further research is needed to determine whether the experimental results for the OLMo model can be generalized to other LLM models.
Further evaluation is needed to determine how effective the proposed entropy-memorization law is for inference on real-world datasets.
The relationship between entropy and memorization for different types of data needs to be studied further.
👍