Although a single model that integrates multimodal understanding and generation has recently emerged as promising, it still struggles with the precision of image editing. To address this issue, we introduce the Draw-In-Mind (DIM) dataset, which includes 14M long contextual image-text pairs (DIM-T2I) that enhance complex instruction understanding, and 233K chain-of-thought images generated by GPT-4o that serve as explicit design blueprints for image editing (DIM-Edit). We developed DIM-4.6B-T2I/Edit by concatenating Qwen2.5-VL-3B with the trainable SANA1.5-1.6B via a lightweight two-layer MLP and training on the proposed DIM dataset. DIM-4.6B-Edit achieves state-of-the-art or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit.