Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

Created by
  • Haebom

Author

Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye

Outline

This paper presents a study on jailbreak attacks that exploit vulnerabilities in large-scale language models (LLMs) to induce malicious content generation. In particular, we focus on a common attack method that confuses LLMs by using ambiguous prompts, and analyze the attention weight distribution to reveal the internal relationship between the input prompts and outputs of LLMs. Using statistical analysis methods, we define new metrics such as attention strength (Attn_SensWords), context-dependent score (Attn_DepScore), and attention distribution entropy (Attn_Entropy), and utilize them to propose an attention-based attack (ABA) strategy inspired by the "deception attack" strategy. ABA works by switching the attention distribution of LLMs using overlapping prompts to focus attention on benign parts. In addition, we present an attention-based defense (ABD) strategy based on ABA to improve the robustness of LLMs by adjusting the attention distribution. Through experimental results, we verify the effectiveness of ABA and ABD, and show that the attention weight distribution has a significant impact on the output of LLM.

Takeaways, Limitations

Takeaways:
Analysis of LLM's attention mechanism provides a new perspective on developing jailbreak attack and defense strategies.
We propose practical attack and defense strategies called ABA and ABD, and experimentally verify their effectiveness.
We investigate the impact of attention weight distribution on the output of LLM, providing important insights for enhancing LLM security.
Limitations:
Further research is needed to determine the generalizability of the proposed ABA and ABD and their applicability to various LLMs.
There is a need to verify the effectiveness of ABA and ABD against more sophisticated and diverse jailbreak attack techniques.
Consideration of the influence of other factors (e.g. model architecture, training data) in addition to attention weight distribution analysis is necessary.
👍