This study systematically evaluated the zero-shot chain-of-throat inference performance of GPT-5 as a multimodal inference engine for medical decision support in text-based and visual-based question answering tasks. We evaluated GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 on standardized datasets including MedQA, MedXpertQA, MMLU medical subset, USMLE self-assessment exam, and VQA-RAD. We found that GPT-5 outperformed all baseline models, achieving state-of-the-art accuracy on all QA benchmarks and demonstrating significant performance improvements in multimodal inference. Specifically, on MedXpertQA MM, GPT-5 improved the inference score by +29.26% and the comprehension score by +26.18% compared to GPT-4o, and outperformed licensed human experts by +24.23% and +29.40%, respectively. GPT-5 demonstrated the ability to integrate visual and textual cues to construct a coherent diagnostic inference chain and recommend appropriate high-risk interventions. These results suggest that GPT-5 performs beyond human and even expert levels on controlled multimodal inference benchmarks, providing valuable information for the design of future clinical decision support systems.