Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models

Created by
  • Haebom

Author

Changyi Li, Jiayi Wang, Xudong Pan, Geng Hong, Min Yang

Outline

This paper proposes a framework called ReasoningShield to address the problem of harmful content within Chain-of-Thoughts (CoTs) of Large Reasoning Models (LRMs). While the final solution may appear acceptable, we recognize that harmful content can arise in intermediate stages, and propose a lightweight solution that effectively censors CoTs. We define a multi-level taxonomy for CoT censorship tasks, encompassing 10 risk categories and 3 safety levels, and build the first CoT censorship benchmark consisting of 9.2K query and inference trace pairs. Furthermore, we develop a two-stage training strategy that combines stage-by-stage risk analysis with contrastive learning. ReasoningShield outperforms LlamaGuard-4 by 35.6% and GPT-4o by 15.8%, demonstrating effective generalization across diverse inference paradigms, tasks, and novel scenarios.

Takeaways, Limitations

Takeaways:
Presenting an effective framework to address the problem of hidden harmful content within the CoT of LRM.
Building a multi-level classification system and benchmarks for CoT censorship.
Achieving high performance and generalization ability through a two-step training strategy.
Demonstrated superior performance compared to existing models.
Providing open source resources.
Limitations:
References to specific Limitations are not included in the paper.
👍