Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Image Embedding Sampling Method for Diverse Captioning

Created by
  • Haebom

Author

Sania Waheed, Na Min An

Outline

This paper presents a novel framework that improves the performance of image caption generation by using relatively small VLMs (e.g., BLIPs) instead of computationally expensive state-of-the-art Vision-Language Models (VLMs). To address the problem that existing small VLMs focus on high-dimensional scene descriptions and overlook details, we leverage structured segmentation to generate hierarchical representations that capture both global and local semantic information. We achieve image-caption consistency, semantic integrity, and diversity comparable to larger models without additional model training. Evaluation on the MSCOCO, Flickr30k, and Nocaps datasets yielded Div-2 scores of 0.735, 0.750, and 0.748, respectively, demonstrating high relevance and semantic integrity with human-generated captions.

Takeaways, Limitations

Takeaways:
We demonstrate that high-performance image caption generation is possible even in environments with limited computational resources (e.g., mobile devices, assistive technologies).
We present an efficient method to significantly improve the performance of small VLMs without additional model training.
We demonstrate that a hierarchical representation generation technique using structural segmentation is effective in increasing the diversity and informativeness of image captions.
Limitations:
The proposed framework may be dependent on a specific small VLM (BLIP). Generalization performance to other small VLMs requires further research.
The accuracy of structural segmentation can impact the quality of the final caption. Improving segmentation performance may contribute to future performance improvements.
In addition to the Div-2 score used in this paper, a multifaceted performance analysis may be required by adding other evaluation metrics.
👍