Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

What Matters in Data for DPO?

Created by
  • Haebom

Author

Yu Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, Chonghuan Wang

Outline

Direct Preference Optimization (DPO) is a simple and effective approach for aligning large-scale language models (LLMs) with human preferences without a learned reward model. This study systematically studies the preference data characteristics most important for DPO performance. We demonstrate that the quality of selected responses plays a crucial role in optimizing the DPO objective function, while the quality of rejected responses may have a relatively limited impact. Online DPO configuration for selected responses behaves similarly to supervised learning, and experiments across various tasks demonstrate that improving the quality of selected responses consistently improves performance.

Takeaways, Limitations

The quality of responses selected from preference data has the most significant impact on DPO performance.
The quality of rejected responses has relatively little impact on DPO performance.
Online DPO is similar to supervised learning for selected responses.
Improving the quality of selected responses consistently improves performance across a variety of tasks.
We investigate the benefits of blending on-policy data.
We validate our proposals through extensive experiments.
(Limitations is not specified in the paper)
👍