Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs

Created by
  • Haebom

Author

Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, Suhyun Kim

Outline

This paper presents a novel framework that consolidates multi-turn adversarial "jailbreak" prompts into single-turn queries, significantly reducing the manual effort required for adversarial testing of large-scale language models (LLMs). Multi-turn human jailbreaks have shown high attack success rates but require significant human resources and time. The proposed multi-turn-single-turn (M2S) method (Hyphenize, Numberize, Pythonize) systematically reformats multi-turn conversations into structured single-turn prompts. Despite eliminating repetitive interactions, these prompts maintain and often improve adversarial efficacy. In extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, the M2S method achieves attack success rates ranging from 70.6% to 95.9% on several state-of-the-art LLMs. Remarkably, the single-turn prompts outperform the original multi-turn attack by up to 17.5 percentage points and reduce average token usage by more than half. Further analysis reveals that embedding malicious requests in structures like enumerations or codes exploits "contextual blind spots" to bypass both basic safeguards and external input/output filters. The M2S framework transforms multi-round conversations into concise, single-round prompts, providing a scalable tool for large-scale adversarial testing and exposing a critical weakness in modern LLM defenses.

Takeaways, Limitations

Takeaways:
We present a method to efficiently transform multi-round adversarial attacks into single-round attacks, significantly improving the efficiency of adversarial testing of LLM.
We demonstrate that single-turn prompts achieve higher attack success rates than multi-turn prompts, exposing vulnerabilities in existing defense mechanisms.
We present a novel attack technique that exploits the "contextual blind spot" of LLM.
Provides a scalable framework for large-scale adversarial testing.
Limitations:
Further research is needed to determine the generalizability of the proposed M2S method.
It may only be effective against certain types of LLM or certain types of adversarial attacks.
The M2S method may not be effective against all types of jailbreak attacks. Further evaluation is needed against a wider range of attack types.
👍