Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Created by
  • Haebom

Author

Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li

Outline

We address the problem of safety alignment for large-scale language models with a Mixture-of-Experts (MoE) architecture. Specifically, we formalize and analyze the "positional vulnerability" where the safety-related behavior of an MoE model depends on specific expert modules. We present an analytical framework, called SAFEx, to identify, characterize, and validate safety-critical experts, classifying them into Harmful Content Detection Groups (HCDGs) and Harmful Response Control Groups (HRCGs). We investigate causality and test mitigation strategies using expert-level interventions. We demonstrate that blocking SAFEx-selected experts significantly impacts safety behavior for the Qwen3-30B-A3B model. Furthermore, we use LoRA to perform lightweight adaptation targeting HRCGs and improve the rejection rate for adversarial prompts without full model retraining through negative weight merging.

Takeaways, Limitations

Takeaways:
We defined and analyzed specific issues (location vulnerabilities) to ensure the safety of the MoE model.
We present a method for effectively identifying and classifying safety professionals through the SAFEx framework.
We propose a practical approach to improve the security of the MoE model through expert-level interventions (masking, LoRA-based adaptation).
We present a computationally efficient, expert-level safety intervention pathway.
Limitations:
Only experimental results for a specific model (Qwen3-30B-A3B) and setup (top-8 routing) are presented, so generalization may be limited.
SAFEx performance may vary depending on model architecture and data.
The effectiveness of LoRA-based adaptation is focused only on HRCG, and its impact on other safety-related aspects requires further study.
👍