Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures

Created by
  • Haebom

Author

Yu He, Yingxi Li, Colin White, Ellen Vitercik

Outline

To evaluate the algorithmic inference capabilities of large-scale language models (LLMs), we focus on structural reasoning, which involves understanding and manipulating relationships such as order, hierarchy, and connectivity. DSR-Bench is the first benchmark to systematically evaluate LLM structural reasoning using standard data structures that provide interpretable and algorithmically meaningful abstractions. It includes 20 data structures, 35 operations, and 4,140 synthetically generated problem instances. Its hierarchical design allows for precise identification of specific failure modes, while fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs revealed that even the best-performing model scored 0.498 on difficult instances, revealing significant limitations. Further evaluation revealed weaknesses in inference capabilities for spatial data, natural language scenarios, and self-generated code. DSR-Bench provides a principled diagnostic tool for structural reasoning, identifying inference bottlenecks and supporting the development of more robust and reliable LLMs.

Takeaways, Limitations

Takeaways:
DSR-Bench, a new benchmark for evaluating LLM's structural reasoning capabilities, is presented.
Systematically evaluate LLM's reasoning ability through various data structures and operations.
A concrete presentation of the structural reasoning capabilities of cutting-edge LLMs Limitations
Contribute to identifying inference bottlenecks in LLM and suggesting ways to improve them.
Limitations:
Evaluation scope limited to specific data structures and operations
The possibility that other types of reasoning abilities exist that DSR-Bench does not cover.
Evaluation results may depend on the specific LLM architecture or training data.
👍