Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

The NordDRG AI Benchmark for Large Language Models

Created by
  • Haebom

Author

Tapio Pitk aranta

Outline

This paper introduces the NordDRG-AI-Benchmark, the first publicly available benchmark for evaluating the reasoning ability of diagnosis-related groups (DRGs), a crucial component of hospital funding. Given that trillions of dollars in healthcare spending in OECD countries are channeled through DRG systems, transparency and auditability are crucial. The NordDRG-AI-Benchmark includes a machine-readable NordDRG definition table, an expert manual, and a change log template. It provides two benchmarks: a logic benchmark (13 tasks) and a grouper benchmark (13 tasks). The logic benchmark includes code lookups, cross-table reasoning, grouping functions, multilingual terminology, and CC/MCC validation, while the grouper benchmark requires perfect emulation of the DRG grouper. Experimental results show that GPT-5 Thinking and Opus 4.1 achieved high scores on the logic benchmark, but even GPT-5 Thinking failed to perfectly emulate the grouper benchmark. This benchmark can contribute to objectively evaluating the performance of LLMs in the field of hospital financing.

Takeaways, Limitations

Takeaways:
We provide the first public, rule-perfect benchmark for DRG inference, providing a baseline for evaluating the applicability of LLM to healthcare.
The practical applicability of LLM can be objectively evaluated through benchmarks for perfect emulation of DRG groupers.
Provides reproducible and comparable assessments using accurate match scores.
It can contribute to increasing transparency and auditability of hospital financing.
Limitations:
Current LLMs struggle to perfectly emulate the full DRG grouper logic.
The benchmark is specific to the NordDRG system and may not be directly applicable to other DRG systems.
We need more diverse LLMs and a wider range of test cases.
👍