Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Investigating the Robustness of Deductive Reasoning with Large Language Models

Created by
  • Haebom

Author

Fabian Hoppe, Filip Ilievski, Jan-Christoph Kalo

Outline

This paper is the first to systematically investigate the robustness of the deductive reasoning capabilities of large-scale language models (LLMs). To evaluate the robustness of formal and informal LLM-based inference methods, we propose a framework that generates seven transformed datasets using two types of perturbations: adversarial noise and counterfactual statements. We categorize LLM reasoners based on their inference format, formalization syntax, and feedback for error recovery, and analyze the strengths and weaknesses of each method. Experimental results show that adversarial noise influences automatic formalization, while counterfactual statements affect all approaches. Detailed feedback reduces syntactic errors but does not improve overall accuracy, highlighting the difficulty in self-correcting efficiency of LLM-based methods.

Takeaways, Limitations

Takeaways: We conducted the first systematic study on the robustness of LLM-based inference methods, identifying the limitations of LLM's deductive reasoning capabilities and suggesting potential improvements. By revealing its vulnerability to adversarial noise and counterfactual statements, we suggest directions for developing more robust LLM-based inference systems. We classified and compared various LLM inference methods, clearly demonstrating their strengths and weaknesses.
Limitations: The proposed perturbation framework may not comprehensively address all types of errors. Further research is needed to examine the effectiveness of automatic formatting and detailed feedback. Because the results are limited to a specific LLM and dataset, generalizability to other LLMs and datasets needs to be verified. Specific suggestions for improving self-correction efficiency are lacking.
👍