Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning

Created by
  • Haebom

Author

Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu

Outline

This paper highlights that securely aligned large-scale language models (LLMs) are vulnerable to harmful fine-tuning attacks. A small amount of harmful data mixed into the fine-tuning dataset can break the secure alignment of the LLM. We show that existing defenses are ineffective under certain training hyperparameters (e.g., high learning rates or a large number of training epochs). Therefore, we propose Antidote, a post-fine-tuning solution that is independent of the training hyperparameters used during the fine-tuning phase. Antidote is based on the principle of removing harmful parameters to recover harmful models from harmful behavior. Experimentally, we demonstrate that Antidote reduces harmful scores while maintaining the accuracy of downstream tasks by introducing a one-time pruning step that removes harmful weights responsible for generating harmful content. The code is available on GitHub.

Takeaways, Limitations

Takeaways: We present a novel defense technique (Antidote) that protects LLM from harmful fine-tuning attacks regardless of the hyperparameters in the fine-tuning stage. This simple method achieves a reduction in harmful scores while maintaining the accuracy of downstream tasks.
Limitations: Further research is needed on the general effectiveness of Antidote and its robustness against various types of malicious data. It may be vulnerable to certain types of attacks or hyperparameter combinations. Accuracy may degrade during pruning.
👍