Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Created by
  • Haebom

Author

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

Outline

This paper presents research on a multi-turn-to-single-turn (M2S) technique that condenses repetitive red team activities into a single, structured prompt. Unlike previous studies that rely on a few handwritten templates, this paper proposes the X-Teaming Evolutionary M2S framework, which automatically discovers and optimizes M2S templates using a language model (LLM)-based evolutionary algorithm. It uses smart sampling from 12 sources and a StrongREJECT-inspired LLM as a judge, resulting in a fully auditable log. After five evolutionary generations, with a success threshold of 0.70, we achieve an overall success rate of 44.8% (103 out of 230) on two new template families and GPT-4.1. Through 2,500 cross-model evaluations, we demonstrate that structural improvements are transferable, but vary across target models. We find a positive correlation between prompt length and scores, highlighting the importance of length-sensitive judgement. The source code, configuration, and results are available on GitHub.

Takeaways, Limitations

Takeaways:
We present a novel framework for automatically generating and optimizing M2S templates using a language model-based evolutionary algorithm.
Emphasizes the importance of threshold setting and cross-model evaluation for successful M2S template generation.
Suggesting future research directions by revealing the correlation between prompt length and performance.
Although it shows the possibility of structural improvement, it suggests that performance differences between models should be taken into account.
Limitations:
The success rate of 44.8% still leaves room for improvement.
Performance is poor for certain models (both models score 0 at the same threshold).
There is a dependency on the LLM used (GPT-4.1), and further research is needed on generalizability to other LLMs.
Further analysis and in-depth understanding of the correlation between prompt length and performance is needed.
👍