Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Created by
  • Haebom

Author

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

Outline

To address the safety risks posed by fine-tuning large-scale language models (LLMs), this paper proposes Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection, a method that reduces fine-tuning safety risks. Unlike existing safety defense strategies that focus solely on the safety layer, FGSN considers the interactions between the safety layer and individual neurons, implementing a more precise and efficient safety mechanism. FGSN projects the parameters of safety neurons toward the safety direction, thereby improving the safety of the model while better aligning them with human preferences. Extensive experiments on several fine-tuned LLM models demonstrate that our method significantly reduces harmfulness scores and attack success rates while minimizing parameter modifications, while maintaining model usability. Furthermore, by introducing a task-specific, multidimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continuous defense and generalization capabilities against unpredictable new safety problems.

Takeaways, Limitations

Takeaways:
A novel method to effectively improve the safety of fine-tuned LLMs is presented.
Implementation of a safety mechanism that is more precise and efficient than existing methods
Reduce hazards and maintain usability with minimal parameter modifications
Ensure continuous defense against unpredictable safety issues and generalizability
Limitations:
Further research is needed on the practical applicability and scalability of the proposed method.
Generalizability across various LLM architectures and fine-tuning strategies needs to be verified.
Robustness assessments are needed for new types of safety risks.
👍