Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Created by
  • Haebom

Author

Donggyu Lee, Sungwon Park, Yerin Hwang, Hyoshin Kim, Hyunwoo Oh, Jungwon Kim, Meeyoung Cha, Sangyoon Park, Jihee Kim

Outline

We present a new benchmark for evaluating the causal inference capabilities of large-scale language models (LLMs). To overcome the limitations of existing benchmarks, we extract causal relationships from top economics and finance journals, constructing a set of 40,379 evaluation items. This set includes five types of tasks across five domains: health, environment, technology, law, and culture. Experiments on eight cutting-edge LLMs revealed that even the best-performing model achieved only 57.6% accuracy. Model scaling does not lead to improved performance, and even advanced inference models struggle to identify basic causal relationships.

Takeaways, Limitations

Takeaways:
Demonstrates LLM's lack of causal reasoning ability.
Emphasizes the need for reliable causal inference in high-risk applications.
Suggesting that model size does not guarantee improved performance.
Limitations:
Specific Limitations is not provided.
👍