This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding
Created by
Haebom
Author
Hanyin Wang, Zhenbang Wu, Gururaj Kolar, Hariprasad Korsapati, Brian Bartlett, Bryan Hull, Jimeng Sun
Outline
DRG-Sapphire is a large-scale reinforcement learning (RL) model that automates DRG code generation from clinical notes. It is based on Qwen2.5-7B and trained using Group Relative Policy Optimization (GRPO) and rule-based rewards. DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not previously seen in mathematical work. It achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates pseudo-validated inferences for DRG assignments, improving explainability.
Takeaways, Limitations
•
DRG-Sapphire uses RL to automate DRG coding from clinical notes, improving accuracy and explainability.
•
RL performance is approximately linearly proportional to the logarithm of the number of supervised learning (SFT) examples, suggesting that the effectiveness of RL is limited by the domain knowledge encoded in the underlying model.
•
For OOD tasks such as DRG coding, sufficient knowledge infusion prior to RL is required to achieve robust RL performance.
•
For these tasks, extending SFT may be more effective and computationally efficient than extending RL alone.