Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Created by
  • Haebom

Author

Jun Zhuang, Haibo Jin, Ye Zhang, Zhengjian Kang, Wenbin Zhang, Gaby G. Dagher, Haohan Wang

Outline

This paper investigates the vulnerability of intent detection to enhance the security of large-scale language models (LLMs). While previous research has successfully used intent detection to improve the regulatory mechanisms of LLMs, their robustness against malicious manipulation has not been sufficiently studied. In this paper, we propose IntentPrompt, an intent-based prompt improvement framework. IntentPrompt transforms harmful questions into structured outlines, then iteratively optimizes the prompts through a feedback loop to reconstruct them into declarative-style narratives, thereby increasing the jailbreak success rate for red teaming activities. Extensive experiments on four public benchmarks and various black-box LLMs demonstrate that the proposed framework outperforms state-of-the-art jailbreak methods and evades advanced intent analysis (IA) and chain-of-thought (CoT)-based defenses. Specifically, the "FSTR+SPIN" variant achieves attack success rates of 88.25% and 96.54% against CoT-based defenses on the o1 model, and 86.75% and 97.12% against IA-based defenses on the GPT-4o model . These results highlight serious vulnerabilities in LLM's safety mechanisms and suggest that intention manipulation poses a growing challenge to content moderation regulators.

Takeaways, Limitations

Takeaways:
Exposes vulnerabilities in LLM's intent detection-based safety mechanism.
An effective attack method against malicious prompt engineering (IntentPrompt) is presented.
The need to develop a more powerful defense mechanism to improve the safety of LLM is raised.
Demonstrates that intention manipulation poses a serious threat to content moderation.
Limitations:
Further research is needed on the generalization performance of the proposed method.
The generalizability of the results to specific LLMs and benchmarks needs to be reviewed.
Testing of more diverse and sophisticated defense techniques is needed.
Further validation of its effectiveness in real-world settings is needed.
👍