[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model

Created by
  • Haebom

Author

Yi Nian, Shenzhe Zhu, Yuehan Qin, Li Li, Ziyi Wang, Chaowei Xiao, Yue Zhao

Outline

In this paper, we propose JAILDAM, a novel framework for jailbreak attack detection for secure deployment of multimodal large-scale language models (MLLMs). To address the shortcomings of existing methods, which are (1) applicable only to white-box models, (2) high computational cost, and (3) insufficient labeled data, JAILDAM utilizes a memory-based approach with policy-based insecure knowledge representation. By dynamically updating the insecure knowledge at test time, it maintains efficiency while improving generalization performance even against unseen jailbreak strategies. Experimental results on several VLM jailbreak benchmarks demonstrate that JAILDAM achieves state-of-the-art performance in both accuracy and speed.

Takeaways, Limitations

Takeaways:
We present a novel framework, JAILDAM, which significantly improves the jailbreak attack detection performance of MLLM.
We propose an efficient detection method that is applicable to real environments, not limited to the white box model.
Helps solve data shortage problems by reducing dependence on labeled data.
Dynamically updating knowledge at test time to improve generalization performance to new jailbreak strategies.
Limitations:
Further research is needed to determine how sustainable the generalization performance of the proposed method is.
Robustness evaluation of different MLLM architectures and jailbreaking strategies is needed.
Performance evaluation and stability verification in actual service environments are required.
👍