Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CoreThink: A Symbolic Reasoning Layer to reason over Long Horizon Tasks with LLMs

Created by
  • Haebom

Author

Jay Vaghasiya, Omkar Ghugarkar, Vishvesh Bhat, Vipul Dholaria, Julian McAuley

Outline

CoreThink is a state-of-the-art inference layer built on a novel inference method called General Symbolics. It differs from existing inference paradigms such as test-time scaling, supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). The CoreThink General Symbolic Reasoner (GSR) is structured around three key use cases: tool invocation, code generation, and planning, and demonstrates outstanding performance across seven benchmarks in each domain. Specifically, it achieved state-of-the-art performance (SOTA) scores of 66.66% on Livecodebench v6, 89% on Instruction-Following Evals, and 24.4% on ARC-AGI-2. Furthermore, we present an agent coding IDE developed using the principles of General Symbolics, achieving a state-of-the-art accuracy of 62.3% on SWE-Bench Lite. This performance improvement was achieved without any fine-tuning or training costs. The CoreThink inference layer is designed to deliver pure performance gains, ensuring that the accuracy of the model's inference tasks never degrades. The authors argue that existing methods will ultimately lead to diminishing returns in LLM performance, necessitating the development of new inference techniques. This technical report details the CoreThink approach at a high level and the availability of CoreThink models for inference-intensive use cases.

Takeaways, Limitations

Takeaways:
Introducing CoreThink, a new reasoning method based on General Symbolics.
Outstanding performance compared to existing methods (SFT, RLVR, etc.) (SOTA achieved in multiple benchmarks including Livecodebench v6, Instruction-Following Evals, ARC-AGI-2, SWE-Bench Lite, etc.).
Achieve performance gains without the cost of fine-tuning and training.
Opening models for inference-intensive use cases.
Suggesting the need for new inference techniques to improve LLM performance.
Limitations:
The specific details of the General Symbolics method presented in this paper are not described in detail.
Although it showed excellent performance in various benchmarks, its performance in some benchmarks was relatively low (e.g. ARC-AGI-2 24.4%).
Further research is needed on the generalization performance and scalability of General Symbolics.
Further validation of CoreThink's practical applications and limitations is needed.
👍