Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

Created by
  • Haebom

Author

Luigi Celona, Simone Bianco, Marco Donzella, Paolo Napoletano

Outline

This paper presents a novel method for combining captions from multiple state-of-the-art models to overcome the limitations of existing state-of-the-art image captioning models trained on the short captions from the MS-COCO dataset, which fail to capture complex scenes or details. Captions are ranked using a novel image-text-based metric, BLIPScore, and the top two captions are fused using a large-scale language model (LLM) to generate richer and more detailed descriptions. Experiments on the MS-COCO and Flickr30k datasets demonstrate improved caption-to-image alignment and reduced hallucinations based on the ALOHa, CAPTURE, and Polos metrics. Subjective studies confirm that the method generates captions more consistent with human judgment. This method combines the strengths of various state-of-the-art models to enhance the quality of image captions and generate captions more suitable for training visual-language and captioning models.

Takeaways, Limitations

Takeaways:
We present a novel method to solve the problem of generating short and general captions, a limitation of existing image captioning models.
Combine results from multiple cutting-edge models to generate richer, more accurate captions.
Effective caption ranking and selection using a new image-text-based metric called BLIPScore.
Performance improvements were verified through ALOHa, CAPTURE, Polos metrics, and subjective studies.
Contribute to improving training data for visual-language and captioning models.
Limitations:
Although the proposed method has the advantage of not requiring additional model training, processing the results of multiple state-of-the-art models may increase computational cost.
Further research is needed on the generalization performance of the BLIPScore metric and its applicability to other datasets.
Analysis of potential errors or biases that may occur during caption fusion using LLM is needed.
👍