Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

JudgeLRM: Large Reasoning Models as a Judge

Created by
  • Haebom

Author

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He

Outline

This paper explores the use of large-scale language models (LLMs) as evaluators. Existing supervised fine-tuning (SFT) approaches have limitations in tasks requiring complex inference. This paper investigates whether LLM evaluators benefit substantially from improved inference performance. Our results reveal a negative correlation between improved SFT performance and the proportion of samples with high inference demands. To overcome this limitation, we propose JudgeLRM, a novel LLM based on reinforcement learning (RL) that utilizes judger-driven rewards. JudgeLRM outperforms SFT-based models and state-of-the-art inference models, particularly in judgment tasks requiring deep inference. JudgeLRM-3B outperforms GPT-4 by 2.79% in F1 score, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79%.

Takeaways, Limitations

Takeaways:
We propose JudgeLRM, a new approach that utilizes LLM as an evaluator, to overcome the limitations of the existing SFT method.
Through reinforcement learning, we achieved improvements in LLM's inference ability and evaluation performance.
JudgeLRM outperforms the existing best-performing models on evaluation tasks requiring complex reasoning.
We suggest the possibility of increasing the scalability and efficiency of LLM-based evaluation systems.
Limitations:
It is possible that the performance improvements of JudgeLRM may be limited to specific datasets or tasks.
The complexity and computational cost of reinforcement learning-based learning processes can be high.
Further research is needed on the transparency and explainability of JudgeLRM's judgment criteria.
Verification of generalization performance across various domains and evaluation tasks is required.
👍