Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Created by
  • Haebom

Author

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding

StableUN: Robust Unlearning via Neighborhood-Aware Optimization

Outline

This paper highlights a security vulnerability in current LLM unlearning methods, which are vulnerable to "relearning" attacks. We demonstrate that existing methods induce model parameters from sharp minima in the loss landscape, creating unstable regions, which can easily be recovered with only a small amount of fine-tuning data. To address this, we propose StableUN, a bi-level feedback-guided optimization framework that explores more stable parameter regions through neighborhood-aware optimization. StableUN integrates forgetting feedback, which explores parameter neighborhoods using adversarial perturbation, and remembering feedback, which preserves model utility, aligning these two objectives through gradient projection. On the WMDP and MUSE benchmarks, we demonstrate that StableUN exhibits stronger resistance to relearning and jailbreaking attacks while maintaining competitive utility performance.

Takeaways, Limitations

Takeaways:
A new methodology is presented to address the core security vulnerabilities of LLM unlearning.
Securing strong defense against relearning attacks
Perform unlearning while effectively preserving model utility.
Exploring a more stable parameter space through neighborhood-aware optimization
Limitations:
Validation is needed on datasets and models other than WMDP and MUSE benchmarks.
Potential increase in computational complexity and training time
Performance changes due to adversarial perturbation settings and adjustments
👍