Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Created by
  • Haebom

Author

Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum

Outline

This paper presents ColorCtrl, a novel method for text-based color editing of images and videos. To address the challenges of existing learning-free methods, which struggle with accurate color control and introduce visual inconsistencies, ColorCtrl leverages the attention mechanism of the Multi-Modal Diffusion Transformer (MM-DiT). By manipulating attention maps and value tokens, ColorCtrl separates structure and color, enabling accurate and consistent color editing and word-level attribute intensity control. It modifies only the regions specified by prompts, leaving irrelevant regions intact, and outperforms existing methods and commercial models (FLUX.1 Kontext Max, GPT-4o Image Generation) on the SD3 and FLUX.1-dev datasets. It is also applicable to video models such as CogVideoX, specifically improving temporal consistency and editing stability. It also generalizes to instruction-based diffusion editing models such as Step1X-Edit and FLUX.1 Kontext dev.

Takeaways, Limitations

Takeaways:
Leveraging the attention mechanism of multi-mode diffusion transformers to enable accurate and consistent text-based color editing.
Provides word-level attribute strength control.
Modify only the area specified in the prompt to minimize the impact of unrelated areas.
It has general applicability to images and videos and various diffusion models.
It shows superior performance than existing learning-free methods and commercial models.
Improves temporal consistency and editing stability when editing video.
Limitations:
The paper does not explicitly address specific Limitations. Additional experiments or analyses are needed to uncover potential practical implications (e.g., performance degradation, computational overhead, memory usage, etc.) for specific types of images/videos.
👍