This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Created by
Haebom
Author
Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
Outline
This paper analyzes the vulnerability of fine-tuned large-scale language models (LLMs) to backdoor attacks and explores the internal mechanisms of these attacks in an interpretable manner. Using a tripartite causal analysis framework called Backdoor Attribution (BkdAttr), we propose a Backdoor Probe that proves the presence of backdoor features within the representations they learn, and develop Backdoor Attention Head Attribution (BAHA) to efficiently identify the specific attention heads that process these features. Experimental results show that removing just 3% of all heads can reduce the attack success rate (ASR) by over 90%. Furthermore, we demonstrate that backdoor attacks can be controlled with a single-point intervention on a single representation by leveraging the Backdoor Vector derived from these heads.
Takeaways, Limitations
•
Takeaways:
◦
Provides new insights into the internal mechanisms of the LLM backdoor attack.
◦
Presenting a practical methodology for controlling backdoor attacks (Backdoor Vector).
◦
Contribute to mitigating backdoor attacks and establishing defense strategies.
◦
LLM presents new directions in the field of safety research.
•
Limitations:
◦
May be limited to specific LLM models and backdoor attack types.
◦
Further research is needed on the generalizability of the backdoor vector and its effectiveness in other attack scenarios.
◦
Complexity of building and applying a backdoor vector.
◦
Limitations of overall defense against backdoor attacks.