Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models

Created by
  • Haebom

Author

Zhihua Tian, Sirun Nan, Ming Xu, Shengfang Zhai, Wenjie Qu, Jian Liu, Ruoxi Jia, Jiaheng Zhang

Outline

Text-to-image (T2I) diffusion models have made remarkable progress in high-quality image generation, but they also raise concerns about generating harmful or misleading content. Extensive approaches have been proposed to remove unwanted concepts without retraining, but these approaches result in poor performance on general generation tasks. In this study, we propose a novel framework, Interpret then Deactivate (ItD), that enables accurate concept removal from T2I diffusion models while maintaining overall performance. ItD first uses a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating specific features associated with the target concept, we reuse the SAE as a zero-shot classifier that identifies whether the input prompt contains the target concept, enabling selective concept removal from the diffusion model. We also demonstrate that ItD can easily remove multiple concepts without additional training. Comprehensive experiments on celebrity identities, artistic styles, and explicit content demonstrate the effectiveness of ItD in removing target concepts, while maintaining general concept generation. ItD is also robust to adversarial prompts designed to bypass content filters. The code can be found at https://github.com/NANSirun/Interpret-then-deactivate .

Takeaways, Limitations

Takeaways:
We propose ItD, a novel framework for accurately removing unwanted concepts from T2I diffusion models.
Eliminate target concept without overall performance degradation.
Multiple concepts can be eliminated without additional training.
Robustness to adversarial prompts.
Open source code provided.
Limitations:
Further research is needed on the accuracy and generalization performance of concept interpretation using SAE.
Generalization performance evaluation for various T2I diffusion models and concepts is needed.
Possible limitations in the effectiveness of removing certain types of concepts.
👍