Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Created by
  • Haebom

Author

Jixuan Leng, Chengsong Huang, Langlin Huang, Bill Yuchen Lin, William W. Cohen, Haohan Wang, Jiaxin Huang

Outline

CrossWordBench is a novel benchmark that evaluates reasoning ability through the interaction of text-based clues and a visual grid structure. It utilizes crossword puzzles for both large-scale language models (LLMs) and large-scale vision-language models (LVLMs), providing puzzles in both text and image formats and allowing for varying the difficulty by adjusting the dictionary filling ratio. Evaluations on over 20 models demonstrate that LLMs with reasoning capabilities significantly outperform non-inference models in solving crossword puzzles, and that LVLMs exhibit a strong correlation between puzzle-solving performance and grid parsing accuracy. This study highlights the limitations of current LLMs and LVLMs in reasoning ability and presents an effective method for generating multimodal constraint tasks for future evaluation.

Takeaways, Limitations

Takeaways:
A new benchmark for evaluating multimodal reasoning capabilities that considers the interaction between text and images is presented.
Identifying the Correlation Between the Inference Ability of LLMs and the Grid Parsing Ability of LVLMs
Presenting a flexible benchmark framework that offers a variety of difficulty levels and evaluation methods.
Clearly demonstrates the limitations of the inference capabilities of current LLMs and LVLMs.
Limitations:
Assessment limited to a specific task, such as a crossword puzzle
Lack of in-depth analysis of the causes of performance degradation in LVLMs.
Further research is needed to determine the generalizability of the benchmark.
👍