Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Rethinking Exact Unlearning under Exposure: Extracting Forgotten Data under Exact Unlearning in Large Language Model

Created by
  • Haebom

Author

Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu

Outline

This paper highlights the limitations of unlearning techniques to address the potential leak of sensitive information from the training data of large-scale language models (LLMs). Specifically, in a real-world deployment environment where both pre- and post-unlearning logit APIs are exposed, we propose a novel data extraction attack that leverages signals from the pre-unlearned model to extract patterns from deleted data from the post-unlearned model. This attack significantly improves the data extraction success rate by combining model guidance and token filtering strategies, and we highlight the real-world risks through a medical diagnosis dataset. This study suggests that unlearning may actually increase the risk of personal information leakage and suggests evaluating unlearning techniques against a broader threat model, including adversarial approaches to the pre-unlearned model.

Takeaways, Limitations

Takeaways:
While accurate unlearning methods are considered the "gold standard" for privacy, they can have vulnerabilities in real-world deployments.
Data extraction attacks using information from pre-unlearning models are possible, and this allows for a significant portion of deleted data to be restored even after unlearning.
The effectiveness of the attack is also verified on real-world datasets, such as medical diagnostic datasets, suggesting the potential risks of unlearning.
When assessing the security of unlearning technologies, additional threat models, such as adversarial approaches to prior models, must be considered.
Limitations:
This study focuses on a specific environment where pre/post unlearning logit APIs are exposed.
Despite the improved success rate of data exfiltration attacks, complete recovery of deleted data is not guaranteed.
Further research is needed to determine the generalizability of the attack and its applicability to various unlearning techniques.
This study focuses on a specific dataset and attack technique, limiting its ability to draw generalized conclusions about other datasets and attack methods.
👍