Large-Scale Language Models (LLMs) have had a significant impact on AI innovation, but their specialization in textual information processing has limitations. To overcome these limitations, researchers have developed Vision-Language Models (VLMs) by integrating visual capabilities with LLMs. This paper covers key advances in the VLM field and categorizes them into three categories: visual-language understanding models, multimodal input processing models that generate a single-modal (text) output, and models that process both multimodal inputs and outputs. We analyze the architecture, training data, strengths, and weaknesses of each model, and evaluate their performance on various benchmark datasets.