Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Created by
  • Haebom

Author

Wenhan Wu, Zheyuan Liu, Chongyang Gao, Ren Wang, Kaize Ding

StableUN: A Framework for Stable LLM Unlearning

Outline

This paper identifies security vulnerabilities in Large Language Model (LLM) unlearning techniques and proposes StableUN, a novel framework to address them. Existing unlearning methods appear to remove sensitive or harmful information, but they are vulnerable to retraining attacks. This vulnerability stems from the tendency to position model parameters at sharp minima in the loss function. StableUN proposes a bidirectional feedback-based optimization framework that leverages neighborhood information to address these vulnerabilities. This framework integrates forgetting feedback, which explores parameter neighborhoods using adversarial perturbation, and remembering feedback, which preserves model utility, aligning the two objectives through gradient projection. Experiments on the WMDP and MUSE benchmarks demonstrate that StableUN exhibits stronger resistance to retraining and jailbreaking attacks while maintaining competitive utility performance.

Takeaways, Limitations

Takeaways:
Clearly identifies vulnerabilities in existing LLM unlearning techniques and emphasizes the need for safe unlearning.
We present StableUN, a novel unlearning framework for exploring stable model parameter regions.
Demonstrated strong defense against relearning and jailbreaking attacks.
Improve unlearning performance while maintaining model utility.
Limitations:
Generalization performance verification is needed for datasets and models other than WMDP and MUSE benchmarks.
Potential increase in computational complexity and training time.
Further research is needed on methodologies that find the optimal balance of forgetting and remembering feedback.
👍