This paper addresses how to align the output of a large-scale language model (LLM) with human preferences using reinforcement learning from human feedback (RLHF). Most existing RLHF algorithms learn reward functions using the Bradley-Terry model, but this model's assumptions about human preferences may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm that improves performance despite errors in the reward model. Theoretically, the proposed algorithm reduces the variance of the reward and policy estimators, providing improved regret bounds. Experimental evaluations on LLM benchmark datasets show that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being preferred over the baseline algorithm on the Anthropic Helpful and Harmless dataset.