Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Created by
  • Haebom

Author

Gabriel J. Perin, Runjin Chen, Xuxi Chen, Nina ST Hirata, Zhangyang Wang, Junyuan Hong

Outline

This paper addresses the safety issue of large-scale language models (LLMs), especially those related to answering socially harmful questions. We experimentally demonstrate that aligned models can be compromised by additional fine-tuning despite previous efforts to improve safety. We reveal that this vulnerability stems from the sensitivity of the safety-related low-rank subspace in the LLM parameters to fine-tuning, and based on this insight, we propose a novel training-free method, Low-Rank Extrapolation (LoX). LoX improves safety robustness by extrapolating the safety subspace of aligned LLMs. Experimental results show that LoX significantly improves the robustness against harmful or malicious fine-tuning attacks while maintaining the adaptability of the model to new tasks. For example, LoX reduces the attack success rate (ASR) against harmful or malicious fine-tuning attacks by 11% to 54%. By examining the ASR landscape of parameters, we explain that the success of LoX is because extrapolation moves the LLM parameters to a flatter region, making them less sensitive to perturbations. The code is available at github.com/VITA-Group/LoX에서.

Takeaways, Limitations

Takeaways:
Introducing LoX, a new training-free method to improve the safety of LLM.
We experimentally demonstrate that LoX can significantly improve robustness against malicious or fine-tuned attacks.
The root cause of LLM safety vulnerabilities is identified as the sensitivity of low-dimensional subspaces.
Presenting a new direction for LLM safety improvement research.
Limitations:
The performance of LoX is based on experimental results for specific datasets and models, and further research is needed on generalizability.
The effectiveness of LoX against various types of attacks and fine-tuning methods needs to be verified.
Further analysis of the computational cost and applicability of LoX is needed.
👍