This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Text-to-image (T2I) diffusion models have made remarkable progress in high-quality image generation, but they also raise concerns about generating harmful or misleading content. Extensive approaches have been proposed to remove unwanted concepts without retraining, but these approaches result in poor performance on general generation tasks. In this study, we propose a novel framework, Interpret then Deactivate (ItD), that enables accurate concept removal from T2I diffusion models while maintaining overall performance. ItD first uses a sparse autoencoder (SAE) to interpret each concept as a combination of multiple features. By permanently deactivating specific features associated with the target concept, we reuse the SAE as a zero-shot classifier that identifies whether the input prompt contains the target concept, enabling selective concept removal from the diffusion model. We also demonstrate that ItD can easily remove multiple concepts without additional training. Comprehensive experiments on celebrity identities, artistic styles, and explicit content demonstrate the effectiveness of ItD in removing target concepts, while maintaining general concept generation. ItD is also robust to adversarial prompts designed to bypass content filters. The code can be found at https://github.com/NANSirun/Interpret-then-deactivate .