[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Created by
  • Haebom

Author

Johannes Ackermann, Takashi Ishida, Masashi Sugiyama

Outline

This paper addresses the overfitting problem that occurs when training a language model (LM) in reinforcement learning with human feedback (RLHF). RLHF is a method to train an LM to follow complex human preferences. The LM is first trained with supervised fine-tuning, and then a reward model (RM) is trained using this data by sampling response pairs to obtain human feedback. The LM is then trained using RL methods to maximize the reward provided by the RM. As the training progresses, the responses generated by the LM change from those seen during RM training, which leads to overfitting, which makes the RM inaccurate. This paper investigates this overfitting problem from a distribution shift perspective, and shows that this shift leads to inconsistent estimation of RM parameters and inconsistent estimation of policy gradients. To address this, we propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policies the RM using importance weights. We show that OCRM produces more accurate RMs without requiring new labels or samples, and experimentally leads to improved final policies. We demonstrate significant performance improvements over existing RLHF methods and baselines through experiments on summary and chatbot datasets, and disclose the implementation code.

Takeaways, Limitations

Takeaways:
A new interpretation and solution to the overoptimization problem occurring in RLHF
Improved performance over existing RLHF methods with a new method called OCRM
Performance improvements without new labels or samples
Summary and validation on chatbot datasets
Improving accessibility through open source code disclosure
Limitations:
Further research is needed on the generalization performance of the proposed method.
Additional experiments are needed on different types of language models and tasks.
Need for analysis of computational cost and efficiency of OCRM
Need to assess the dependence on quality and quantity of human feedback
👍