This paper presents a novel method for improving the safety of large-scale language models (LLMs). Existing safety training methods often rely on fine-tuning, which forces the model to reject responses to malicious requests, which often leads to poor model performance. In this paper, we propose a method to add a special token called 'red flag token' to the model vocabulary and train the model to insert this token into responses when malicious content is generated or likely to be generated. This method allows the model to learn the concept of harmfulness explicitly while maintaining the usefulness of the model, and provides the same robustness as adversarial training by evaluating each generated response. In addition, we encapsulate the safety tuning using LoRA modules, providing additional defense against fine-tuning API attacks.