This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
Rethinking Reward Models for Multi-Domain Test-Time Scaling
Created by
Haebom
Author
Dong Bok Lee, Seanie Lee, Sangwoo Park, Minki Kang, Jinheon Baek, Dongki Kim, Dominik Wagner, Jiongdao Jin, Heejun Lee, Tobias Bocklet, Jinyu Wang, Jingjing Fu, Sung Ju Hwang, Jiang Bian, Lei Song
Outline
This paper analyzes various variations of the compensation model (RM) used to assess the reliability of large-scale language models (LLMs) during scaling testing. Previous research has assumed that process compensation models (PRMs), which assign scores at each intermediate inference step, outperform outcome compensation models (ORMs), which only evaluate the final answer. However, this paper comprehensively evaluates four compensation model variations (discrete ORM and PRM, and generative ORM and PRM) across 14 different domains. We find that the discrete ORM performs equally well as the discrete PRM, the generative PRM is uncompetitive, and the generative ORM is the most robust, achieving significant gains across all test domains. We attribute this to the PRM's step-by-step scores inheriting noise from LLM automatic labeling and the difficulty in assessing long inference trajectories, including self-correcting inference.
Takeaways, Limitations
•
Takeaways:
◦
This suggests that fine-grained supervision does not always lead to better results, and demonstrates that generative outcome validation can be effective for multi-domain deployments.
◦
We show that the step-by-step score of the PRM method tends to accumulate errors, suggesting the possibility of performance degradation as the inference length increases.
◦
We support future research by releasing the code, datasets, and checkpoints needed to compare and evaluate the performance of reward models across various domains.
•
Limitations:
◦
Information on specific LLM architectures or performance variations with model size is limited.
◦
Because the study domains were limited to 14, further verification is needed to determine generalizability to more diverse environments.
◦
Further in-depth analysis is needed to understand why generative ORMs perform better.