This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Created by
Haebom
Author
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, Sean Hendryx
Outline
Rubrics as Rewards (RaR) is an on-policy reinforcement learning method that extends Reinforcement Learning with Verifiable Rewards (RLVR) beyond the verifiable domain by using rubric-based feedback in healthcare and science. RaR evaluates several strategies for aggregating rubric feedback into rewards and achieves relative improvements of up to 31% on HealthBench and 7% on GPQA-Diamond, outperforming the popular Large Language Model (LLM) as judge baseline, which relies on Likert-based rewards. RaR adapts to a variety of assessment formats and demonstrates robust performance on both rubric-based and multiple-choice tasks.
Takeaways, Limitations
•
Introducing the on-policy reinforcement learning methodology RaR: Extending RLVR beyond the verifiable domain by leveraging rubric-based feedback.
•
Experimental results on HealthBench and GPQA-Diamond in medical and scientific domains: RaR outperforms the LLM as judge baseline.
•
Adapts to various evaluation formats and reduces performance variance with small judges and judge size.