Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Created by
  • Haebom

Author

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Jialing Tao, Hui Xue

Outline

This paper presents "Constructive Safety Alignment (CSA)," a novel safety alignment paradigm that considers risks arising not only from malicious users but also from vulnerable users experiencing psychological distress. Unlike existing safety mechanisms that simply reject malicious behavior, CSA predicts user responses, fine-tunes risk boundaries, and transforms safety into a trust-building process through interpretable inference control. Implemented on a model called Oyster-I (Oy1), CSA achieves the highest level of safety among existing open models while maintaining high general performance. It performs close to GPT-5 on compositional benchmarks and achieves robustness comparable to GPT-o1 on the Strata-Sword jailbreak dataset. This paper releases the Oy1 model, code, and benchmarks to support responsible and user-centered AI development.

Takeaways, Limitations

Takeaways:
A new security paradigm that considers not only malicious users but also users with psychological vulnerabilities.
Building trust and promoting positive interactions with users through a guidance-centered safety approach rather than simple rejection.
Supporting responsible AI development through the disclosure of the Oy1 model and related materials, which simultaneously achieve high safety and performance.
A New Perspective on User-Centric AI Development
Limitations:
Further research is needed on the effectiveness and generalizability of CSA.
Comprehensive consideration of various types of psychological distress and user situations is needed.
A detailed explanation and data disclosure of the comparison results with GPT-5, GPT-o1, etc. is required.
Continuous monitoring of the long-term safety and stability of the Oy1 model is necessary.
👍