[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention

Created by
  • Haebom

Author

Amro Abdalla, Ismail Shaheen, Dan DeGenaro, Rupayan Mallick, Bogdan Raita, Sarah Adel Bargal

Outline

GIFT presents a gradient-aware immunity technique to defend diffusion models against malicious fine-tuning. Safety mechanisms such as traditional safety checkers can be easily bypassed, and concept deletion methods fail under adversarial fine-tuning. GIFT addresses this problem by framing immunization as a bi-level optimization problem. The high-level objective is to use representational noise and maximization to degrade the model’s ability to represent harmful concepts, while the low-level objective is to maintain performance on safe data. GIFT achieves robust resistance to malicious fine-tuning while maintaining safe generation quality. Experimental results show that the proposed method significantly impairs the model’s ability to relearn harmful concepts while maintaining performance on safe content, suggesting a promising direction for building intrinsically safe generative models that are resilient to adversarial fine-tuning attacks.

Takeaways, Limitations

Takeaways: We present a new direction for improving the safety of diffusion models against malicious fine-tuning. It overcomes the limitations of existing methods and provides an effective way to prevent harmful concept re-learning while maintaining the ability to generate safe content. It can contribute to the development of inherently safe generative models.
Limitations: Further research is needed on the generalization performance of the proposed method and its robustness against various types of adversarial attacks. Due to the limitations of the experimental environment, additional validation is needed for potential problems in real-world applications. It may be effective only against certain types of malicious concepts, and may not guarantee complete defense against all types of malicious fine-tuning.
👍