Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Created by
  • Haebom

Author

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

Outline

This paper proposes DivLogicEval, a new benchmark for evaluating the logical reasoning ability of large-scale language models (LLMs). We address the challenges faced by existing benchmarks, which include a mix of inference abilities, a lack of linguistic diversity, and deviations from ideal benchmark distributions. DivLogicEval utilizes a diverse set of counterintuitive sentences, enabling more reliable evaluation. Furthermore, we propose a new evaluation metric that mitigates the effects of LLM bias and randomness, and compare and analyze the logical reasoning abilities of various LLMs.

Takeaways, Limitations

Takeaways:
A new benchmark, DivLogicEval, is presented to overcome the Limitations of existing logic inference benchmarks.
Proposing a new evaluation metric that takes into account the bias and randomness of LLM.
Suggesting directions for performance improvement through comparative analysis of the logical reasoning capabilities of various LLMs.
Limitations:
Further validation of DivLogicEval's versatility and scalability is needed.
Further research is needed to determine the generalizability and validity of new assessment metrics.
A review is needed to determine whether the proposed benchmark comprehensively covers all types of logical reasoning.
👍