[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens

Created by
  • Haebom

Author

Sophie Xhonneux, David Dobre, Mehrnaz Mofakhami, Leo Schwinn, Gauthier Gidel

Outline

This paper presents a novel method for improving the safety of large-scale language models (LLMs). Existing safety training methods often rely on fine-tuning, which forces the model to reject responses to malicious requests, which often leads to poor model performance. In this paper, we propose a method to add a special token called 'red flag token' to the model vocabulary and train the model to insert this token into responses when malicious content is generated or likely to be generated. This method allows the model to learn the concept of harmfulness explicitly while maintaining the usefulness of the model, and provides the same robustness as adversarial training by evaluating each generated response. In addition, we encapsulate the safety tuning using LoRA modules, providing additional defense against fine-tuning API attacks.

Takeaways, Limitations

Takeaways:
Presenting a new safety education method that can alleviate the performance degradation problem, which is a limitation of existing fine-tuning methods.
Explicitly learn the concept of harmfulness by leveraging red flag tokens and maintain the utility of the model.
Provides robustness at the level of adversarial training, while allowing training without running adversarial attacks.
Provides additional API attack defense capabilities via LoRA modules.
Limitations:
Further research is needed to effectively utilize Red Flag Tokens.
Generalized performance assessments for different types of hazards are needed.
Further analysis is needed on the actual effectiveness and limitations of defense using LoRA modules.
It may still have limited effectiveness for certain types of harmful requests.
👍