This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes a novel framework that integrates attacks and defenses to address vulnerabilities in the security alignment mechanism of large-scale language models (LLMs). Based on the linear separability of LLM intermediate layer embeddings and the inherent nature of jailbreak attacks, which propagate malicious queries to secure regions, we utilize a generative adversarial network (GAN) to learn security decision boundaries within LLMs. Experimental results demonstrate an average jailbreak success rate of 88.85% on three major LLMs and an average defense success rate of 84.17% on a state-of-the-art jailbreak dataset, validating the effectiveness of the proposed method and providing new insights into the internal security mechanisms of LLMs. Code and data are available at https://github.com/NLPGM/CAVGAN .