This paper addresses the overfitting problem that occurs when training a language model (LM) in reinforcement learning with human feedback (RLHF). RLHF is a method to train an LM to follow complex human preferences. The LM is first trained with supervised fine-tuning, and then a reward model (RM) is trained using this data by sampling response pairs to obtain human feedback. The LM is then trained using RL methods to maximize the reward provided by the RM. As the training progresses, the responses generated by the LM change from those seen during RM training, which leads to overfitting, which makes the RM inaccurate. This paper investigates this overfitting problem from a distribution shift perspective, and shows that this shift leads to inconsistent estimation of RM parameters and inconsistent estimation of policy gradients. To address this, we propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policies the RM using importance weights. We show that OCRM produces more accurate RMs without requiring new labels or samples, and experimentally leads to improved final policies. We demonstrate significant performance improvements over existing RLHF methods and baselines through experiments on summary and chatbot datasets, and disclose the implementation code.