Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Created by
  • Haebom

Author

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

Outline

This paper addresses the shortcomings of existing benchmarks and proposes a new benchmark and evaluation metric to evaluate natural language logical reasoning, a key indicator of the intelligence of large-scale language models (LLMs). Existing benchmarks entangle multiple inference techniques, hindering accurate evaluation. Furthermore, they lack linguistic diversity and can lead to biased results that deviate from the ideal benchmark distribution. Therefore, this paper proposes DivLogicEval, a new classical logic benchmark composed of natural language sentences composed of diverse sentences in a non-intuitive manner. Furthermore, it introduces a new evaluation metric that mitigates the influence of bias and randomness in LLMs. Through experiments, we verify the degree of logical reasoning required to answer DivLogicEval's questions and compare the logical reasoning performance of various LLMs.

Takeaways, Limitations

Takeaways:
Point out the problems of existing benchmarks and suggest new benchmarks and evaluation indicators for more reliable evaluation.
Contributes to model performance evaluation by comparing and analyzing the logical reasoning capabilities of various LLMs.
We propose a method to improve the accuracy of logical reasoning ability assessment and prevent biased results.
Limitations:
Further validation is needed to determine how much of a variety of logical thinking the DivLogicEval benchmark actually covers and to what extent it generalizes.
It is necessary to verify that the new evaluation metric completely eliminates all kinds of bias and randomness.
Further research is needed to determine whether the proposed benchmarks and evaluation metrics can be applied to other languages and domains.
👍