Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models

Created by
  • Haebom

Author

Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat

Outline

This paper presents an evaluation of the effectiveness of soft token attacks (STAs) used in machine unlearning of large-scale language models (LLMs). While previous research has demonstrated that STAs can successfully extract unlearned information, this study demonstrates that, in a robust audit environment, STAs can extract any information from LLMs, regardless of whether the information was included in the unlearning algorithm or the original training data. Using benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate this, revealing that even a small number of soft tokens (1-10) can leak an arbitrary string of more than 400 characters. Therefore, we emphasize the need for a cautious approach to effectively deploy STAs in unlearning audits.

Takeaways, Limitations

Takeaways: By clearly presenting the limitations and risks of STA when used to audit LLM unlearning courses, we emphasize the need for developing safer and more effective unlearning techniques. By demonstrating STA's vulnerabilities, we suggest research directions for data security and privacy protection in LLM courses.
Limitations: This study presents results limited to a specific benchmark and audit environment. Further research is needed on various LLM architectures, unlearning algorithms, and real-world datasets. Analysis of attack techniques other than STA is lacking.
👍