Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Best of mini-N in-loop Sampling: A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

Created by
  • Haebom

Author

Hyung Gyu Rho, Sian Lee

Outline

This paper points out that existing preference sorting techniques, such as Best-of-N (BoN) sampling, fail to properly assess the acceptability of responses, potentially leading to the selection of inappropriate options. To address this issue, this paper proposes a reward model trained by adding external options to preference data, inspired by discrete choice models. This model can identify not only better responses but also sufficiently good ones. Based on this, the paper develops an adaptive inference strategy, best of mini-N in-loop, which balances reliability and efficiency. Experimental results show that this technique reduces reliability failures by 70% when used as an alignment guardrail and improves average inference speed by more than 22% when used as an inference accelerator.

Takeaways, Limitations

Takeaways:
A novel data collection and modeling framework is presented to address the reliability issues of preference sorting models.
Overcoming the limitations of existing methods by explicitly considering the acceptability of responses.
Provides a flexible framework that allows you to balance reliability and efficiency.
Demonstrated reliability improvements and inference speed improvements in IMDB-sentiment settings.
Limitations:
Only experimental results from a specific dataset (IMDB-sentiment) are presented, so generalization is limited.
Further research is needed to determine whether the improvement effects will be the same in other problems and environments.
Lack of information about specific implementation details and hyperparameter tuning process.
Potential increase in model complexity and computational cost.
👍