This paper proposes DivLogicEval, a new benchmark for evaluating the logical reasoning ability of large-scale language models (LLMs). We address the challenges faced by existing benchmarks, which include a mix of inference abilities, a lack of linguistic diversity, and deviations from ideal benchmark distributions. DivLogicEval utilizes a diverse set of counterintuitive sentences, enabling more reliable evaluation. Furthermore, we propose a new evaluation metric that mitigates the effects of LLM bias and randomness, and compare and analyze the logical reasoning abilities of various LLMs.