Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Differential Information Distribution: A Bayesian Perspective on Direct Preference Optimization

Created by
  • Haebom

Author

Yunjae Won, Hyunji Lee, Hyeonbin Hwang, Minjoon Seo

Outline

We approach Direct Preference Optimization (DPO) from a Bayesian perspective, interpreting it as a process in which DPO learns the differential information necessary to update the reference policy to the target policy. To achieve this, we introduce Differential Information Distribution (DID) and demonstrate that DPO's log-ratio compensation is justified through DID. We also analyze the impact of DID characteristics on DPO training dynamics and downstream performance.

Takeaways, Limitations

The log-ratio compensation of DPO is justified only if it encodes the differential information necessary to update the reference policy to the target policy.
Commonly observed DPO training dynamics (log-likelihood evolution, policy exploration, etc.) arise from the power-law relationship of DID.
The entropy of a DID is a predictor of downstream performance. High-entropy DIDs are good for following open-ended questions, while low-entropy DIDs are good for knowledge-based question-answering.
While this paper provides a theoretical foundation for DPO and offers practical guidelines, its generalizability may be limited due to a lack of specific experimental results or validation across diverse datasets.
👍