This paper presents ColorCtrl, a novel method for text-based color editing of images and videos. To address the challenges of existing learning-free methods, which struggle with accurate color control and introduce visual inconsistencies, ColorCtrl leverages the attention mechanism of the Multi-Modal Diffusion Transformer (MM-DiT). By manipulating attention maps and value tokens, ColorCtrl separates structure and color, enabling accurate and consistent color editing and word-level attribute intensity control. It modifies only the regions specified by prompts, leaving irrelevant regions intact, and outperforms existing methods and commercial models (FLUX.1 Kontext Max, GPT-4o Image Generation) on the SD3 and FLUX.1-dev datasets. It is also applicable to video models such as CogVideoX, specifically improving temporal consistency and editing stability. It also generalizes to instruction-based diffusion editing models such as Step1X-Edit and FLUX.1 Kontext dev.