Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Beyond Frequency: The Role of Redundancy in Large Language Model Memorization

Created by
  • Haebom

Author

Jie Zhang, Qinghua Zhao, Chi-ho Lin, Zhongfeng Kang, Lei Li

Outline

This paper addresses the privacy and fairness risks of memorization in large-scale language models (LLMs). Unlike previous studies that have shown a correlation between memorization and token frequency and repetition patterns, this study uncovers a unique response pattern in which increasing frequency has a minimal effect (e.g., 0.09) on memorized samples but a significant effect (e.g., 0.25) on non-memorized samples. Using counterfactual analysis, which quantifies the strength of perturbations by changing sample prefixes and token positions, we demonstrate that redundancy correlates with memorization patterns. Our results show that approximately 79% of memorized samples have low redundancy, and these low-redundancy samples are twice as vulnerable as high-redundancy samples. Perturbations decrease the memorized samples by 0.6, while non-memorized samples decrease by only 0.01, indicating that more redundant content is more memorable but also more vulnerable. This suggests that utilizing a redundancy-based approach in data preprocessing can mitigate privacy risks and ensure fairness.

Takeaways, Limitations

Takeaways:
Discovering a new response pattern for LLM memory phenomena (differences in the effect of increasing frequency)
Identifying the correlation between redundancy and memory patterns (high vulnerability with low redundancy)
A redundancy-based approach to data preprocessing (improving privacy and fairness)
Limitations:
Further research is needed to determine whether the results of this study can be generalized to all LLMs.
The need to analyze the influence of factors other than redundancy on memory phenomena
The practical effectiveness of the proposed redundancy-based data preprocessing method needs to be verified.
👍