In this paper, we benchmarked the performance of popular multimodal-based models such as GPT-4o, o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2 on computer vision tasks (semantic segmentation, object detection, image classification, and depth and surface normal prediction) using standard datasets such as COCO and ImageNet. Considering that the models are optimized for text output and many of them only allow API access, we built a standardized benchmarking framework that converts standard vision tasks into text-prompt-based tasks through prompt chaining. As a result, although it falls short of existing expert models, it shows considerable performance on various tasks, and in particular, it outperforms geometric tasks on semantic tasks. GPT-4o shows the best performance among non-inference models, and inference models show improved performance on geometric tasks, but the latest GPT-4o with image generation function shows problems such as hallucination and spatial misalignment.