This paper presents ShizhenGPT, the first multimodal large-scale language model (LLM) specialized in Traditional Chinese Medicine (TCM). To address the lack of high-quality TCM data and the multimodal nature of TCM diagnosis, which encompasses diverse sensory information such as vision, hearing, smell, and pulse diagnosis, which hinder the application of existing LLMs to TCM, we constructed a large-scale TCM dataset consisting of over 100 GB of text data and over 200 GB of multimodal data (including 1.2 million images, 200 hours of audio, and physiological signals). Using this dataset, ShizhenGPT was pre-trained and trained to acquire deep TCM knowledge and multimodal inference capabilities. Evaluation results utilizing recent National TCM Qualification Examination data and visual benchmarks for drug recognition and visual diagnosis demonstrate that ShizhenGPT outperforms other LLMs of similar scale and is competitive with large-scale proprietary models. In particular, among existing multimodal LLMs, this model is the most advanced in TCM visual comprehension, demonstrating integrated recognition capabilities across various modalities, including sound, pulse, smell, and sight, paving the way for holistic multimodal recognition and diagnosis of TCM. The dataset, model, and code are publicly available.