Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Created by
  • Haebom

Author

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel

Outline

This paper presents a scalable framework, X-Teaming, to address the safety risks of language models (LMs) in multi-round interactions. X-Teaming systematically explores how seemingly innocuous interactions escalate to harmful outcomes and generates corresponding attack scenarios. Using collaborative agents for planning, attack optimization, and validation, it achieves state-of-the-art multi-round jailbreak effectiveness and diversity with a success rate of up to 98.1% on leading open- and closed-source models. Specifically, it achieves a 96.2% attack success rate against the state-of-the-art Claude 3.7 Sonnet model, which was previously considered nearly immune to single-round attacks. Furthermore, we introduce XGuard-Train, an open-source multi-round safety training dataset consisting of 30,000 interactive jailbreaks, which is 20 times larger than previous state-of-the-art resources.

Takeaways, Limitations

Takeaways:
A systematic exploration of language model security risks in multi-round interactions and a method for generating attack scenarios are presented.
Development of the X-Teaming framework to achieve cutting-edge multi-jailbreak effectiveness and diversity.
Achieving high attack success rates on the latest models that were difficult to attack using conventional methods.
Release of XGuard-Train, a large-scale multi-pass safety training dataset.
Providing essential tools and insights to mitigate sophisticated interactive attacks.
Limitations:
X-Further research is needed on the generalizability of Teaming and its applicability to various language models.
Further validation of the bias and diversity of the XGuard-Train dataset is needed.
Further evaluation of the effectiveness of X-Teaming in real-world scenarios is needed.
👍