Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HPS: Hard Preference Sampling for Human Preference Alignment

Created by
  • Haebom

Author

Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou

Outline

In this paper, we propose a novel framework, Hard Preference Sampling (HPS), to align responses of large-scale language models (LLMs) to human preferences. Existing Plackett-Luce (PL) and Bradley-Terry (BT) model-based preference optimization methods have problems such as difficulty in handling harmful content, inefficient utilization of non-preferred responses, and high computational cost of PL. HPS solves these problems by introducing a training loss that prioritizes the most preferred responses and rejects all non-preferred and harmful responses. In particular, it enhances the rejection ability of the model by emphasizing “hard” non-preferred responses that are similar to preferred responses, and it maintains alignment quality while reducing computational overhead by utilizing a single-sample Monte Carlo sampling strategy. Theoretically, HPS improves the sample efficiency over existing PL methods and maximizes the compensation margin between preferred and non-preferred responses, ensuring a clearer distinction. Through experiments on the HH-RLHF and PKU-Safety datasets, we verify the effectiveness of HPS, achieving similar BLEU and reward scores while significantly improving the reward margin, thereby reducing the generation of harmful content.

Takeaways, Limitations

Takeaways:
Effectively solve the problems of harmful content processing, inefficient use of non-preferred responses, and high computational cost of existing preference optimization methods.
Improved computational efficiency through a single-sample Monte Carlo sampling strategy.
Maximizes the reward margin between preferred and dispreferred responses to enable clearer distinction.
We verify the reduction of harmful content creation and performance improvement through experimental results on HH-RLHF and PKU-Safety datasets.
Limitations:
Additional experiments and analyses are needed to determine the general performance and limitations of the HPS presented in this paper.
Further research is needed on the applicability and generalization performance of HPS to different types of LLMs and datasets.
A more detailed explanation and analysis of how HPS defines and selects “difficult” non-preferred responses is needed.
👍