Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SynC: Synthetic Image Caption Dataset Refinement with One-to-many Mapping for Zero-shot Image Captioning

Created by
  • Haebom

Author

Si-Woo Kim, MinJu Jeon, Ye-Chan Kim, Soeun Lee, Taewhan Kim, Dong-Jin Kim

Outline

In this paper, we propose SynC, a synthetic dataset cleaning framework for zero-shot image captioning (ZIC). Existing ZICs utilize synthetic datasets generated by text-to-image (T2I) models to reduce expensive manual annotation work, but the images generated by T2I models often have semantic inconsistencies with their captions. Existing data cleaning techniques focus on removing noisy texts from web crawled data, which is not suitable for the features of synthetic data (well-formed captions, inaccurate images). SynC reassigns captions to images that are semantically most consistent with the captions from the existing image pool. It first retrieves multiple candidate images for each caption, and then selects the optimal image by checking whether the original caption can be retrieved through image-to-text retrieval using alignment scores based on circular consistency. Experimental results show that SynC outperforms various ZIC models and benchmarks (MS-COCO, Flickr30k, NoCaps) and achieves state-of-the-art results.

Takeaways, Limitations

Takeaways:
We present a novel data cleansing framework, SynC, that effectively addresses the semantic inconsistency problem of synthetic data.
Unlike conventional filtering or regeneration techniques, we improve data quality by reallocating optimal images within the existing image pool.
The effectiveness of SynC has been proven through performance improvements and state-of-the-art performance achievements across a variety of ZIC models and benchmarks.
It presents new possibilities for the use of synthetic data in zero-shot image captioning.
Limitations:
SynC's performance improvements may be limited to specific benchmarks and models. Generalization performance validation on other datasets or models is needed.
Alignment scores based on circular consistency may not always be accurate in selecting the optimal image. More sophisticated alignment techniques may be required.
Due to limitations of the T2I model itself, the quality of the generated images may still affect the performance of SynC. Development of a higher quality image generation model may be required.
👍