Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Created by
  • Haebom

Author

Jaehan Kim, Minkyoo Song, Seungwon Shin, Sooel Son

Outline

This paper analyzes vulnerabilities in the security of large-scale language models (LLMs) using the Mixture-of-Experts (MoE) architecture and proposes SafeMoE, a novel secure fine-tuning method to address these vulnerabilities. Specifically, we highlight the problem that routing decisions for harmful inputs fluctuate significantly after fine-tuning, making them vulnerable to harmful fine-tuning (HFT) attacks. SafeMoE mitigates routing fluctuations by penalizing the difference between the routing weights of the initially safe-aligned model and the fine-tuned model, thereby maintaining security. Experimental results demonstrate that SafeMoE effectively mitigates HFT attacks and outperforms existing defense methods with minimal degradation in operational utility.

Takeaways, Limitations

Takeaways:
We raised a critical issue regarding HFT attacks, a security vulnerability of MoE LLM, and proposed an effective defense technique called SafeMoE to address it.
SafeMoE maintains security by directly mitigating routing fluctuations and outperforms existing defense methods.
We demonstrate the effectiveness of SafeMoE across a range of model sizes through experiments on open-source MoE LLM.
Limitations:
Although this paper demonstrates the effectiveness of SafeMoE on a specific open-source model, further research is needed to determine its generalizability to other architectures and model sizes.
Additional studies may be required to determine the optimal hyperparameter settings for SafeMoE and to analyze its robustness against various HFT attack types.
There is a computational overhead (2%), and considering this, research is needed on ways to minimize performance degradation when training large-scale models.
👍