This paper presents a multimodal chart editing paradigm that combines natural language and visual indicators. To address the ambiguity of existing natural language-based chart editing methods, we propose a method that expresses user intent in natural language and visual indicators that explicitly highlight elements to be edited. To support this, we present Chart$\text{M}^3$, a novel multimodal chart editing benchmark with multi-level complexity and multi-faceted evaluation. Chart$\text{M}^3$ comprises 1,000 samples with four levels of editing difficulty, each composed of three elements: chart, code, and multimodal indicators. We provide metrics that assess both visual appearance and code correctness, allowing us to comprehensively evaluate chart editing models. Through Chart$\text{M}^3$, this paper demonstrates the limitations of current multimodal large-scale language models (MLLMs), particularly their inability to interpret and apply visual indicators. To address these limitations, we construct Chart$\text{M}^3$-Train, a large-scale training dataset consisting of 24,000 multimodal chart editing samples. Fine-tuning MLLM on this dataset significantly improves performance, demonstrating the importance of multimodal supervised learning. The dataset, code, and evaluation tools are available on GitHub.