Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SLR: Automated Synthesis for Scalable Logical Reasoning

Created by
  • Haebom

Author

Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia W ust, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting

Outline

This paper presents Scalable Logical Reasoning (SLR), an end-to-end framework for systematic evaluation and training of large-scale language models (LLMs). Based on a user's task specification, SLR automatically generates (i) instructional prompts for inductive reasoning tasks, (ii) executable verification programs (with verifiable rewards) for model output, and (iii) potential ground truth rules. This process is fully automated and scalable, requires no human annotation, and allows for precise control of task difficulty. Using SLR, we create SLR-Bench, a benchmark consisting of 19,000 prompts organized into 20 curriculum levels of increasing relational, arithmetic, and recursive complexity. Large-scale evaluations show that state-of-the-art LLMs readily generate syntactically valid rules but often fail to perform accurate logical reasoning. While recent inference LLMs have improved performance, they incur a very high test-time computational cost, exceeding $300 for 1,000 prompts. Finally, curriculum learning via SLR doubled the SLR-Bench accuracy of Llama-3-8B, reaching a level comparable to Gemini-Flash-Thinking at a significantly lower computational cost. Furthermore, this inference ability generalizes to various existing benchmarks, highlighting the effectiveness of SLR for downstream inference.

Takeaways, Limitations

Takeaways:
We present SLR, an efficient and scalable framework for assessing and improving logical reasoning skills in LLMs.
Building an automated system that automatically generates prompts, validation programs, and ground truth rules without human intervention.
Empirically demonstrating that LLM reasoning skills can be significantly improved through curriculum learning.
We provide a new large-scale benchmark called SLR-Bench to objectively evaluate the inference capabilities of LLMs.
Achieve similar performance to existing top-performing models at a lower cost.
We demonstrate that the improved inference capabilities generalize across a variety of benchmarks.
Limitations:
Currently, SLR-Bench focuses on a specific type of logical reasoning problem, and further research is needed to generalize its performance to various types of reasoning problems.
The high cost of calculating test times for high-performance LLMs is still an area that requires improvement in the future.
Since the performance of SLR may depend on the specific LLM architecture, further experiments on different architectures are needed.
👍