This paper presents a novel method for combining captions from multiple state-of-the-art models to overcome the limitations of existing state-of-the-art image captioning models trained on the short captions from the MS-COCO dataset, which fail to capture complex scenes or details. Captions are ranked using a novel image-text-based metric, BLIPScore, and the top two captions are fused using a large-scale language model (LLM) to generate richer and more detailed descriptions. Experiments on the MS-COCO and Flickr30k datasets demonstrate improved caption-to-image alignment and reduced hallucinations based on the ALOHa, CAPTURE, and Polos metrics. Subjective studies confirm that the method generates captions more consistent with human judgment. This method combines the strengths of various state-of-the-art models to enhance the quality of image captions and generate captions more suitable for training visual-language and captioning models.