This paper proposes "Visual Thinking," a novel framework that mimics human reasoning to improve the performance of large-scale multimodal models (LMMs) on complex, multi-stage tasks. Visual Thinking overcomes the limitations of text-based reasoning by allowing LMMs to reason using self-generated concept diagrams. This framework is optimized by integrating beam search and deep backtracking into a graph-based inference framework, enabling a zero-shot approach that operates solely on task descriptions. Experimental results in the PDDL planning domain demonstrate significant improvements over existing methods on a variety of complex planning problems, such as Blocksworld and Floor Tiles. Specifically, it significantly improves the solution rate of the GPT-4o model on the Blocksworld problem from 35.5% to 90.2%, and even outperforms the o1-preview model on more challenging tasks. This demonstrates the crucial role of concept diagrams as an inference medium for LMMs.