[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Created by
  • Haebom

Author

Zhehao Zhang, Weijie Xu, Fanyou Wu, Chandan K. Reddy

Outline

In this paper, we propose a comprehensive resource called FalseReject, which consists of 16,000 seemingly harmful questions and structured responses across 44 safety-related categories to address the problem of safety-aligned approaches to large-scale language models (LLMs) excessively rejecting even innocent questions. We present a graph-based adversarial multi-agent interaction framework to generate diverse and complex prompts, and provide structured responses with explicit inferences to help the model accurately distinguish between safe and unsafe contexts. FalseReject includes custom training datasets and human-annotated benchmark test sets for both standard directive-tuned models and inference-driven models. We demonstrate the persistent problem of excessive rejections through extensive benchmarking on 29 state-of-the-art (SOTA) LLMs, and experimentally demonstrate that fine-tuning supervised learning using FalseReject significantly reduces unnecessary rejections without compromising overall safety or general language functions.

Takeaways, Limitations

Takeaways:
We present a new dataset (FalseReject) and training framework to address the excessive rejection problem in LLM.
Validation of the effectiveness of a graph-based adversarial multi-agent interaction framework for generating diverse and complex prompts.
We experimentally demonstrate that fine-tuning using FalseReject simultaneously improves the safety and usability of LLM.
Provides general solutions applicable to different types of LLM.
Limitations:
Additional validation is needed on the size and diversity of the FalseReject dataset.
Further research is needed on the generalization performance of the proposed framework.
Additional performance evaluation and safety verification in real environments are required.
There may be bias towards certain language models or certain types of questions.
👍