Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Created by
  • Haebom

Author

Yeonwoo Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz

Outline

This paper demonstrates that certain machine learning unlearning methods are vulnerable to simple prompt attacks. We systematically evaluate eight unlearning techniques across three model families, assessing their ability to retrieve presumably unlearned knowledge through output-based, logit-based, and probe analyses. While methods such as RMU and TAR exhibit robust unlearning, ELM is vulnerable to certain prompt attacks (e.g., adding Hindi filler text to the original prompt recovers 57.3% accuracy). Logit analysis reveals that unlearned models are less likely to hide knowledge through changes in answer format, given the strong correlation between output and logit accuracy. These results challenge conventional assumptions about the effectiveness of unlearning and highlight the need for a reliable evaluation framework that can distinguish genuine knowledge removal from superficial output suppression. To facilitate further research, we present an evaluation framework that facilitates the evaluation of prompting techniques for retrieving unlearned knowledge.

Takeaways, Limitations

Takeaways: By revealing that some unlearning techniques are vulnerable to prompt attacks, we raise the need to reexamine the effectiveness of unlearning. We clearly demonstrate the difference between robust unlearning techniques, such as RMU and TAR, and vulnerable techniques, such as ELM. The published evaluation framework can contribute to future research.
Limitations: The types of models and unlearning techniques used in the evaluation may be limited. A comprehensive analysis of various types of prompt attacks may be lacking. Further research is needed to determine generalizability to real-world applications.
👍