Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Created by
  • Haebom

Author

Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, Yang Zhang

Outline

Jailbreak Assessment via Decompositional Scoring (JADES) is a general-purpose framework for evaluating jailbreak success, designed to improve upon existing inaccurate and subjective assessment methods. It decomposes harmful questions into weighted subquestions and scores each subanswer to arrive at a final decision. Additionally, it can optionally include a fact-checking module to enhance hallucination detection. In this paper, we present a new benchmark, JailbreakQR, consisting of 400 jailbreak prompt-response pairs, and validate JADES against it. JADES achieves 98.5% agreement with human raters, demonstrating over 9% improvement over existing methods and exposing the problem of overestimation in existing assessment methods.

Takeaways, Limitations

Takeaways:
Contributes to resolving the inaccuracy and subjectivity issues of existing jailbreak success rate evaluations.
JADES provides accurate, consistent, and interpretable jailbreak attack assessments.
Providing a reliable baseline for measuring future jailbreak attacks.
Correcting the success rate of jailbreak attacks, which was overestimated in previous studies.
Limitations:
The JailbreakQR benchmark may be relatively limited in scale.
Further research is needed on different types of jailbreak attacks and generalization performance for LLM.
Further validation of the performance and reliability of the fact-checking module is needed.
👍