Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

The Hidden Link Between RLHF and Contrastive Learning

Created by
  • Haebom

Author

Xufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen

Mutual Information Optimization (MIO)

Outline

This paper presents a methodology for aligning large-scale language models (LLMs) to human-like values. Specifically, we interpret the representative methods, Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), from the perspective of mutual information (MI) maximization. This demonstrates their connection to contrastive learning and analyzes their use of the Donsker-Varadhan (DV) lower bound, a MINE estimator. Furthermore, we propose Mutual Information Optimization (MIO), which replaces the DV/MINE lower bound with the Jensen-Shannon (JS) MI estimator. Through theoretical analysis and experiments, we demonstrate that MIO mitigates the late-stage performance degradation seen in DPO and achieves competitive performance on various inference and mathematical benchmarks.

Takeaways, Limitations

Takeaways:
We present a new perspective by interpreting RLHF and DPO from the perspective of MI maximization and contrast learning.
We propose MIO to complement the shortcomings of DPO and achieve better performance.
The validity of the methodology is proven through theoretical analysis and extensive experiments.
Limitations:
Although MIO's superiority has been demonstrated through comparison with existing methodologies such as DPO, fundamental limitations may still exist.
Further research is needed to determine whether performance improvements for specific benchmarks can be generalized to all types of LLM sorting problems.
Information on the computational costs and resources required for implementing and learning MIO is not specifically provided.
👍