This paper addresses the shortcomings of existing benchmarks and proposes a new benchmark and evaluation metric to evaluate natural language logical reasoning, a key indicator of the intelligence of large-scale language models (LLMs). Existing benchmarks entangle multiple inference techniques, hindering accurate evaluation. Furthermore, they lack linguistic diversity and can lead to biased results that deviate from the ideal benchmark distribution. Therefore, this paper proposes DivLogicEval, a new classical logic benchmark composed of natural language sentences composed of diverse sentences in a non-intuitive manner. Furthermore, it introduces a new evaluation metric that mitigates the influence of bias and randomness in LLMs. Through experiments, we verify the degree of logical reasoning required to answer DivLogicEval's questions and compare the logical reasoning performance of various LLMs.