Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

Created by
  • Haebom

Author

Jingcong Liang, Shijun Wan, Xuehai Wu, Yitong Li, Qianglong Chen, Duyu Tang, Siyuan Wang, Zhongyu Wei

Outline

This paper introduces HardcoreLogic, a benchmark designed to evaluate the ability of large-scale inference models (LRMs) to flexibly adapt rules to various conditions, particularly when faced with non-standard game variations. To overcome the limitations of existing benchmarks, HardcoreLogic includes over 5,000 puzzles across ten games, modifying standard puzzles along three dimensions: increasing complexity (IC), singularity (UE), and unsolvable puzzles (UP). Experimental results show that even models that achieve high scores on existing benchmarks significantly degrade on HardcoreLogic, suggesting a reliance on memorized patterns.

Takeaways, Limitations

Takeaways:
HardcoreLogic provides a new benchmark to evaluate the high-order logic reasoning capabilities of LRMs.
We show that high performance on existing benchmarks may not reflect real-world inference capabilities.
Test the robustness of your model by introducing increasing complexity, unusual elements, and unsolvable puzzle variations.
We found that the model had difficulty adapting to subtle rule changes.
Error analysis allowed us to identify gaps in the true inference ability of LRM.
Limitations:
HardcoreLogic focuses on evaluating the performance of LRM and does not provide specific methodologies or guidelines for improving the model.
Further research is needed to determine whether the games and variations used in the benchmark encompass all types of logical reasoning abilities.
Further research is needed to deeply analyze the causes of the model's performance degradation.
👍