Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Created by
  • Haebom

Author

Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian

Outline

This paper proposes a novel framework that integrates attacks and defenses to address vulnerabilities in the security alignment mechanism of large-scale language models (LLMs). Based on the linear separability of LLM intermediate layer embeddings and the inherent nature of jailbreak attacks, which propagate malicious queries to secure regions, we utilize a generative adversarial network (GAN) to learn security decision boundaries within LLMs. Experimental results demonstrate an average jailbreak success rate of 88.85% on three major LLMs and an average defense success rate of 84.17% on a state-of-the-art jailbreak dataset, validating the effectiveness of the proposed method and providing new insights into the internal security mechanisms of LLMs. Code and data are available at https://github.com/NLPGM/CAVGAN .

Takeaways, Limitations

Takeaways:
Provides a new understanding of LLM's internal security mechanisms.
We present an efficient jailbreak attack and defense framework utilizing GANs.
The effectiveness of the method is demonstrated by achieving high jailbreak success rates (88.85%) and defense success rates (84.17%).
We present a new direction for strengthening LLM security.
Limitations:
Since these results are experimental on specific LLM and Jailbreak datasets, further research is needed to determine their generalizability.
GAN-based methods can be computationally expensive.
Further validation of its adaptability to new jailbreak attack techniques is needed.
👍