Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Evaluating Transparent Reasoning in Large Language Models for Accountable Critical Tasks

Created by
  • Haebom

Author

Junhao Chen, Bowen Wang, Jiuyang Chang, Yuta Nakashima

Outline

REACT is a benchmark designed to rigorously evaluate the inference capabilities of large-scale language models (LLMs) in responsible, high-stakes decision-making tasks in healthcare and law. Unlike existing benchmarks that focus on predictive accuracy, REACT emphasizes transparent and interpretable inference, requiring models to closely align their logic with expert-derived procedures. To assess how closely LLM inferences align with human experts, 511 clinical cases in healthcare and 86 legal cases in law were annotated with detailed expert-derived rationales supporting each step of the inference process. These annotations were guided by carefully constructed inference graphs that explicitly encode domain-specific inference structures and decision criteria derived from domain experts. These inference graphs not only serve as a standard for expert annotations, but also serve as structured guidelines for models to make transparent, step-by-step inferences. To address the scalability issue of manual annotations, we developed a semi-automatic annotation pipeline that efficiently generates new graphs using expert-defined inference graph templates, exploring the potential for extending our approach to additional important domains. Our experimental results show that the inference graph significantly improves the interpretability and accuracy of LLM inference compared to existing baselines, but it still leaves a significant gap with expert-level inference performance.

Takeaways, Limitations

Takeaways: The REACT benchmark provides a new way to rigorously evaluate the inference capabilities of LLMs in high-stakes decision-making domains such as medicine and law. The inference graph can be used to make the inference process of LLMs transparent and interpretable. It suggests the possibility of extending to additional domains through a semi-automatic annotation pipeline.
Limitations: There is a significant difference in the inference performance of LLM compared to expert-level inference performance. The generalizability may be limited as the dataset is limited to medical and legal fields. The reliance on manual annotation is still high, which may make it difficult to build large-scale datasets.
👍