[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Created by
  • Haebom

Author

Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave

Outline

This paper discusses the latest AI developments that use multiple layers of safeguards to protect against catastrophic misuse of state-of-the-art AI systems. We note that the security of safeguard pipelines from several developers, including Anthropic’s Claude 4 Opus model, is unclear, and that there is a lack of prior research on evaluating and attacking them. This paper aims to address this gap by developing an open-source defense pipeline and red-teaming it. We develop a novel few-shot prompt-based input and output classifier that outperforms the existing state-of-the-art safeguard model, ShieldGemma, and present a novel attack technique called Staged Attack (STACK) that achieves a significant success rate even in a black-box environment. Finally, we present mitigations that developers can use to prevent staged attacks.

Takeaways, Limitations

Takeaways:
A few-shot prompt-based input and output classifier outperforms existing state-of-the-art failsafe models.
Demonstrates the possibility of effective attacks on state-of-the-art safety device pipelines using the staged attack (STACK) technique.
By demonstrating the possibility of attacks in a black box environment, we clearly present the vulnerability of AI safety devices.
Provides specific mitigation measures to prevent staged attacks.
Limitations:
Further research is needed to determine the effectiveness of the currently proposed mitigation measures.
Further research is needed on generalizability across different AI models and safety pipelines.
Further validation of attack success rates in real-world environments is needed.
👍