Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Evaluating Compositional Generalization in VLMs and Diffusion Models

Created by
  • Haebom

Author

Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis

Outline

This paper evaluates the performance of the Vision-Language Model (VLM) on a fundamental aspect of natural language semantics: the ability to form new meanings by combining existing parts. We note that VLMs, such as CLIP, tend to represent images in a "bag-of-words" manner, failing to adequately capture compositional meaning. We investigate whether a generative classifier, a diffusion model-based classifier, can overcome this limitation. We evaluate the ability of three models—the Diffusion Classifier, CLIP, and ViLT—to combine objects, attributes, and relationships in zero-shot learning (ZSL) and generalized zero-shot learning (GZSL) environments. Our experimental results demonstrate that while the Diffusion Classifier and ViLT perform well on concept combination tasks, all models struggle with relational GZSL tasks, highlighting the challenges of VLM in relational inference. Analysis of the CLIP embedding suggests that the difficulty stems from the excessive similarity in the representation of relational concepts such as "left" and "right."

Takeaways, Limitations

Takeaways: This suggests that diffusion model-based classifiers may have improved configurational generalization ability compared to conventional VLMs. The superior performance of the Diffusion Classifier and ViLT on concept association tasks is particularly noteworthy.
Limitations: The fact that all models struggle significantly with the relational GZSL task suggests the need for further research into VLM's relational inference capabilities. While CLIP embedding analysis provides some clues to the cause, further analysis is needed. In addition to the similarity issue in relational concept representations, other factors may contribute to VLM's poor relational inference performance.
👍