Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Created by
  • Haebom

Author

Ziyun Zeng, Junhao Zhang, Wei Li, Mike Zheng Shou

Draw-In-Mind (DIM)

Outline

Although a single model that integrates multimodal understanding and generation has recently emerged as promising, it still struggles with the precision of image editing. To address this issue, we introduce the Draw-In-Mind (DIM) dataset, which includes 14M long contextual image-text pairs (DIM-T2I) that enhance complex instruction understanding, and 233K chain-of-thought images generated by GPT-4o that serve as explicit design blueprints for image editing (DIM-Edit). We developed DIM-4.6B-T2I/Edit by concatenating Qwen2.5-VL-3B with the trainable SANA1.5-1.6B via a lightweight two-layer MLP and training on the proposed DIM dataset. DIM-4.6B-Edit achieves state-of-the-art or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit.

Takeaways, Limitations

Takeaways:
Explicitly assigning design responsibilities to the understanding module provides significant benefits for image editing.
DIM-4.6B-Edit achieves SOTA or competitive performance despite its small parameter size.
Contribute to the advancement of research by making the DIM dataset and model public.
Limitations:
Limitations, which is stated in the paper itself, is not included.
👍