Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks

Created by
  • Haebom

Author

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

Outline

To address the safety risks arising from injecting domain-specific knowledge into large-scale language models (LLMs) in fine-tuning as a service (FaaS), this paper proposes Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection, a method that reduces fine-grained safety risks. To overcome the limitations of existing safety layer mapping methods, we integrate multi-scale interactions between safety layers and fine-grained neurons to localize sparse and accurate fine-grained safety neurons and minimize interference with subtask neurons. We project safety neuron parameters toward the safety direction to enhance model safety and better align with human preferences. Extensive experiments on various fine-tuned LLM models demonstrate that FGSNs significantly reduce harmfulness scores and attack success rates while minimizing parameter modifications, while maintaining model usability. Furthermore, we introduce a task-specific, multidimensional, heterogeneous safety neuron cluster optimization mechanism to achieve continuous defense and generalization capabilities against unpredictable new safety problems.

Takeaways, Limitations

Takeaways:
We present a novel method (FGSN) to effectively reduce the safety risks of fine-tuned LLM.
It provides more sophisticated and efficient safety assurance than the existing crude safety layer mapping method.
Maintain model usability while minimizing parameter modifications.
Provides continuous defense and generalization capabilities against new and unpredictable safety issues.
Limitations:
The performance of the proposed method may vary depending on the LLM model and dataset used.
Further research is needed to determine the generalization performance for complex safety problems in the real world.
A detailed description of the optimization process of a safety neuron cluster optimization mechanism specialized for a specific task may be lacking.
👍