We present a new benchmark for evaluating the causal inference capabilities of large-scale language models (LLMs). To overcome the limitations of existing benchmarks, we extract causal relationships from top economics and finance journals, constructing a set of 40,379 evaluation items. This set includes five types of tasks across five domains: health, environment, technology, law, and culture. Experiments on eight cutting-edge LLMs revealed that even the best-performing model achieved only 57.6% accuracy. Model scaling does not lead to improved performance, and even advanced inference models struggle to identify basic causal relationships.