Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Created by
  • Haebom

Author

Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

Sample Scheduling for Direct Preference Optimization (DPO)

Outline

This paper highlights that the performance of Direct Preference Optimization (DPO), which has emerged as an effective method for aligning large-scale language models (LLMs) with human preferences, is critically dependent on the quality of the underlying human preference data. While previous research has explored various data selection strategies, these approaches have overlooked the impact of the evolving state of the language model during the optimization process. Therefore, this paper presents a novel problem: sample scheduling for DPO. We aim to dynamically and adaptively schedule training samples based on the evolving batch-by-batch state of the model throughout preference optimization. To address this, we propose SamS, an efficient and effective algorithm that adaptively selects samples from each training batch based on learning feedback from the LLM, maximizing potential generalization performance. By incorporating SamS into the DPO algorithm, we achieve significant performance improvements across tasks without modifying the core DPO algorithm, while minimizing additional computational overhead.

Takeaways, Limitations

Takeaways:
A novel approach to improve the performance of DPO: Improving LLM alignment through batch-wise sample selection.
We propose SamS, an effective sample scheduling algorithm that improves performance with minimal computational overhead.
Suggesting generalizability to RLHF and a wider range of supervised learning paradigms.
Limitations:
There is no specific mention of Limitations in the paper.
👍