Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

LLM Robustness Leaderboard v1 --Technical report

Created by
  • Haebom

Author

Pierre Peigne - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe

Outline

PRISM Eval presented its LLM robustness leaderboard and submitted a technical report for the Paris AI Action Summit. This report introduces the PRISM Eval Behavior Elicitation Tool (BET), an AI system that performs automated adversarial testing via dynamic adversarial optimization. BET achieved a 100% attack success rate (ASR) on 37 of 41 state-of-the-art LLMs. Beyond simple pass/fail evaluations, we proposed a fine-grained robustness metric that estimates the average number of attempts required to induce harmful behavior, demonstrating a more than 300-fold difference in attack difficulty between models. We also introduced baseline vulnerability analysis to identify the most effective jailbreaking techniques for specific risk categories. This collaborative evaluation with trusted third parties from the AI Safety Network provides a practical path toward distributed robustness evaluation across the community.

Takeaways, Limitations

Takeaways:
We demonstrate that an automated adversarial testing system (BET) based on dynamic adversarial optimization can effectively assess the vulnerability of LLM.
We quantitatively measure the significant differences in the robustness level of LLM across models and present detailed robustness indices.
By analyzing effective jailbreak techniques for specific risk categories, we provide specific directions for LLM development and security enhancement.
We contribute to securing community-based LLM safety by proposing a collaborative model for distributed robustness evaluation.
Limitations:
Only 41 LLMs have been evaluated to date, and more models are needed to be evaluated.
Further analysis is required for the four LLMs where BET's attack success rate did not reach 100%.
Further research is needed on the generalizability and limitations of the proposed granular robustness metrics.
The effectiveness of a particular jailbreaking technique may vary depending on the specific structure and design of the LLM, requiring a more comprehensive analysis.
👍