Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Process Reward Models That Think

Created by
  • Haebom

Author

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang

Outline

In this paper, we propose ThinkPRM, which implements a step-by-step reward model by exploiting the generation of the thought process (CoT) of a language model to improve the data efficiency of step-by-step verifiers (PRMs). ThinkPRM is trained with significantly fewer step-by-step labels (1%) than existing discriminative PRMs, and outperforms existing methods on several benchmarks including ProcessBench, MATH-500, and AIME '24 by leveraging the inference ability of the long CoT model. In particular, it also outperforms existing PRMs in out-of-domain evaluations on some subsets of GPQA-Diamond and LiveCodeBench. It also scales validation computation more efficiently than LLM-as-a-Judge under the same token budget. In conclusion, ThinkPRM demonstrates the value of a generative and long CoT PRM that can scale test-time computation while learning with minimal supervision.

Takeaways, Limitations

Takeaways:
We present a novel method for learning effective step-by-step verifiers even with limited label data.
Improve validation performance by leveraging the inference power of long CoT models.
Increased test time calculation efficiency.
Demonstrated superior performance over existing methods in various benchmarks.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Since this is a performance evaluation result for a specific benchmark, it is necessary to review its applicability to other problem areas.
Experiments using large datasets may be lacking.
👍