Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Created by
  • Haebom

Author

Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, Chengchun Shi

Outline

This paper addresses how to align the output of a large-scale language model (LLM) with human preferences using reinforcement learning from human feedback (RLHF). Most existing RLHF algorithms learn reward functions using the Bradley-Terry model, but this model's assumptions about human preferences may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm that improves performance despite errors in the reward model. Theoretically, the proposed algorithm reduces the variance of the reward and policy estimators, providing improved regret bounds. Experimental evaluations on LLM benchmark datasets show that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being preferred over the baseline algorithm on the Anthropic Helpful and Harmless dataset.

Takeaways, Limitations

Takeaways:
We present a novel robust algorithm that addresses the compensation model error problem of the existing RLHF algorithm.
Improved performance is achieved by reducing the variance of reward and policy estimators.
It shows superior experimental results compared to existing methods on the Anthropic Helpful and Harmless dataset.
Limitations:
Further research is needed on the generalization performance of the proposed algorithm.
Extensive experiments on various LLM benchmark datasets are required.
Performance evaluation in real application environments is required.
👍