Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Created by
  • Haebom

Author

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah

Outline

Large-scale language models (LLMs) are aligned to comply with safety guidelines by rejecting harmful instructions. A recent attack called 'abliteration' isolates and suppresses a single latent direction most responsible for the rejection behavior, allowing the model to generate unethical content. In this paper, we propose a defense technique that modifies the way the model generates rejections. We construct an extended rejection dataset that contains harmful prompts and full responses explaining the reasons for rejection. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (with parameters 1.5B and 3B) on the extended rejection dataset, and evaluate the resulting system on a series of harmful prompts. Experimental results show that the extended rejection model reduces the rejection rate by up to 10%, while maintaining a high rejection rate, unlike the baseline model, which reduces the rejection rate by 70-80% after abliteration. Extensive evaluations on safety and usability demonstrate that extended rejection fine-tuning neutralizes abliteration attacks while maintaining general performance.

Takeaways, Limitations

Takeaways: We present that fine-tuning using an extended rejection dataset is an effective defense against vanishing attacks. We propose a novel approach that can contribute to improving the safety of LLM. We show that the general performance of the model can be preserved while maintaining robustness against vanishing attacks.
Limitations: Focused on defense against a specific attack (extinction attack), and its effectiveness against other types of attacks requires further study. Further validation of the dataset used and the generalizability of the model is required. The cost and effort of creating and maintaining an extended rejection dataset should be considered.
👍