Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Text Detoxification: Data Efficiency, Semantic Preservation and Model Generalization

Created by
  • Haebom

Author

Jing Yu, Yibo Zhao, Jiapeng Zhu, Wenming Shao, Bo Pang, Zhao Zhang, Xiang Li

Outline

In this paper, we propose a novel text detoxification method that removes harmful content while preserving the original meaning to address the problem of harmful content proliferation on social media. To overcome the limitations of existing methods, such as low performance, difficulty in preserving meaning, vulnerability to out-of-distribution data, and high data dependency, we present a two-stage learning framework. In the first stage, a strong initial model is built through supervised learning fine-tuning using high-quality filtered parallel data, and in the second stage, LLM is trained through Group Relative Policy Optimization using unlabeled harmful input data and a user-defined reward model. Experimental results show that the proposed method effectively alleviates the limitations of existing methods, achieves state-of-the-art performance, improves generalization performance, and significantly reduces the dependency on annotation data. The source code is available on GitHub.

Takeaways, Limitations

Takeaways:
It suggests the possibility of learning effective text decoding models with only small amounts of high-quality data.
It shows improved performance and generalization performance compared to existing methods.
Significantly reduced dependency on annotation data, improving data efficiency.
We present a case study of successful application of Group Relative Policy Optimization to text decoding.
Limitations:
Additional analysis may be required on the design and performance of your custom reward model.
Further validation of generalization performance for different types of harmful content is required.
Performance evaluation in an actual service environment is required.
👍