Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Created by
  • Haebom

Author

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, Sean Hendryx

Outline

Rubrics as Rewards (RaR) is an on-policy reinforcement learning method that extends Reinforcement Learning with Verifiable Rewards (RLVR) beyond the verifiable domain by using rubric-based feedback in healthcare and science. RaR evaluates several strategies for aggregating rubric feedback into rewards and achieves relative improvements of up to 31% on HealthBench and 7% on GPQA-Diamond, outperforming the popular Large Language Model (LLM) as judge baseline, which relies on Likert-based rewards. RaR adapts to a variety of assessment formats and demonstrates robust performance on both rubric-based and multiple-choice tasks.

Takeaways, Limitations

Introducing the on-policy reinforcement learning methodology RaR: Extending RLVR beyond the verifiable domain by leveraging rubric-based feedback.
Experimental results on HealthBench and GPQA-Diamond in medical and scientific domains: RaR outperforms the LLM as judge baseline.
Adapts to various evaluation formats and reduces performance variance with small judges and judge size.
Limitations is not specified in the paper.
👍