Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Created by
  • Haebom

Author

Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan

Outline

This paper explores the limitations of state-of-the-art visual-language models (VLMs) by examining their ability to solve rebus puzzles, visual puzzles. Rebus puzzles encode language through images, spatial arrangements, and symbolic substitutions, requiring multimodal abstraction, symbolic reasoning, and an understanding of cultural, phonetic, and linguistic wordplay. The research team analyzed the performance of modern VLMs by building a manually generated and annotated benchmark consisting of a variety of English rebus puzzles. Puzzles ranged in difficulty from simple image substitutions to those involving spatially dependent cues.

Takeaways, Limitations

Takeaways: We found that VLMs exhibit remarkable abilities in decoding simple visual cues, but struggle with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors. The Rebus Puzzle Benchmark can be a useful tool for evaluating and improving the performance of VLMs.
Limitations: The scale of manually generated benchmarks may be limited. The Rebus puzzle dataset needs to be expanded to account for diverse language and cultural backgrounds. Further research is needed to improve VLMs' abstract reasoning and visual metaphor understanding capabilities.
👍