Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Are Large Language Models Capable of Deep Relational Reasoning? Insights from DeepSeek-R1 and Benchmark Comparisons

Created by
  • Haebom

Author

Chi Chiu So, Yueyue Sun, Jun-Min Wang, Siu Pang Yung, Anthony Wai Keung Loh, Chun Pong Chau

Outline

In this paper, we evaluate and compare the deep relational inference capabilities of three state-of-the-art large language models (LLMs), DeepSeek-R1, DeepSeek-V3, and GPT-4o, on benchmark tasks of genealogy and general graph inference. Experimental results show that DeepSeek-R1 achieves the highest F1 score across tasks and problem sizes, demonstrating its strengths in logical and relational inference. However, as the problem complexity increases, all evaluated models (including DeepSeek-R1) experience significant difficulties due to token length constraints and incomplete output structures. A detailed analysis of DeepSeek-R1’s long Chain-of-Thought responses reveals unique planning and validation strategies, but also highlights instances of inconsistent or incomplete inferences, highlighting the need for a deeper investigation into the internal inference dynamics of LLMs. This paper discusses key directions for future research, including the role of multimodal inference and systematic examination of inference failures, and provides experimental insights and theoretical implications for improving the inference capabilities of LLMs on tasks requiring structured multi-step logical inference. The code repository is publicly available at https://github.com/kelvinhkcs/Deep-Relational-Reasoning .

Takeaways, Limitations

Takeaways: Experimentally demonstrates that DeepSeek-R1 performs well on complex relational inference tasks. Advances understanding of the internal inference process of LLM. Suggests the importance of multimodal inference and inference failure analysis.
Limitations: All models' performance deteriorates as problem complexity increases. Token length limitations and incomplete output structures are identified as the main causes of the performance degradation. Inconsistent or incomplete inference cases of the models are found.
👍