Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

Created by
  • Haebom

Author

Yihao Guo, Haocheng Bian, Liutong Zhou, Ze Wang, Zhaoyi Zhang, Francois Kawala, Milan Dean, Ian Fischer, Yuantao Peng, Noyan Tokgozoglu, Ivan Barrientos, Riyaaz Shaik, Rachel Li, Chandru Venkataraman, Reza Shifteh Far, Moses Pawar, Venkat Sundaranatha, Michael Xu, Frank Chu

Outline

The Adversarial Distilled Retrieval-Augmented Guard (ADRAG) framework, developed for online malicious intent detection, consists of two phases. During the training phase, a large teacher model learns robust decision boundaries for a variety of user queries based on adversarially perturbed augmented retrieval inputs. During the inference phase, a distillation scheduler transfers the teacher model's knowledge to a smaller student model, along with an online-collected knowledge base. Upon deployment, the smaller student model leverages the top K similar safe cases retrieved from the online updated knowledge base, enabling online and real-time malicious query detection. ADRAG, with a 149M-parameter model, achieves 98.5% performance on WildGuard-7B, outperforms GPT-4 by 3.3%, and outperforms Llama-Guard-3-8B by 9.5% in out-of-distribution detection, delivering up to 5.6x lower latency at 300 QPS.

Takeaways, Limitations

Takeaways:
ADRAG provides an efficient and powerful framework for real-time malicious intent detection.
Achieve high accuracy and low latency using small models.
Enables continuous performance improvement through an online updated knowledge base.
It outperforms existing models in various safety benchmarks.
Limitations:
Lack of analysis of specific malicious query types.
Need to verify generalizability to specific domains.
Further research is needed on the efficient management and updating of online knowledge bases.
👍