Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

General Exploratory Bonus for Optimistic Exploration in RLHF

Created by
  • Haebom

Author

Wendi Li, Changdae Oh, Sharon Li

Outline

This paper addresses the problem of optimistic search, a key issue in improving sample efficiency in human-feedback-based reinforcement learning (RLHF). We analyze why existing search bonus methods fail to achieve optimism through KL or α-divergence regularization. We point out that such regularization biases exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior. To address this, we propose the Generalized Search Bonus (GEB), a novel theoretical framework that satisfies the optimism principle. GEB counteracts the divergence bias through reference-dependent reward adjustments, incorporates existing heuristic bonuses as a special case, and naturally extends across the entire α-divergence family. Experimental results demonstrate that GEB consistently outperforms baselines across various divergence settings and on alignment tasks across large-scale language model backbones, demonstrating that GEB is a fundamental and practical solution for optimistic search in RLHF.

Takeaways, Limitations

Takeaways:
We present GEB, a novel exploration bonus framework for improving sample efficiency in RLHF.
We theoretically analyze the bias caused by divergence-based regularization, a problem in existing search bonuses, and propose a solution.
Demonstrating GEB's superior performance on a variety of divergence settings and large-scale language models.
Emphasize that GEB integrates existing exploration bonuses and is extensible.
Limitations:
Lack of information on the specific numerical results and experimental environment of the paper.
Lack of information about the parameters required to implement and tune GEB.
May be limited to certain types of RLHF operations.
Lack of information on the specific proof process of theoretical analysis.
👍