This paper presents a novel framework that improves the performance of image caption generation by using relatively small VLMs (e.g., BLIPs) instead of computationally expensive state-of-the-art Vision-Language Models (VLMs). To address the problem that existing small VLMs focus on high-dimensional scene descriptions and overlook details, we leverage structured segmentation to generate hierarchical representations that capture both global and local semantic information. We achieve image-caption consistency, semantic integrity, and diversity comparable to larger models without additional model training. Evaluation on the MSCOCO, Flickr30k, and Nocaps datasets yielded Div-2 scores of 0.735, 0.750, and 0.748, respectively, demonstrating high relevance and semantic integrity with human-generated captions.