[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Created by
  • Haebom

Author

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

Outline

In this paper, we propose a novel adaptive preference optimization algorithm, called α-DPO, to overcome the limitations of reinforcement learning-based human feedback (RLHF) with computational efficiency and training stability issues as a way to align large-scale language models (LLMs) to human values and intentions. α-DPO introduces a dynamic reward margin to reduce the dependence on the optimal reference model and solve the problem of making suboptimal decisions under diverse data settings. It achieves personalized reward margins by balancing the policy model and the reference model using adaptive preference distributions. Through theoretical guarantees and experimental evaluations on AlpacaEval 2 and Arena-Hard, we demonstrate that α-DPO outperforms DPO and SimPO, demonstrating it as a powerful tool for LLM alignment.

Takeaways, Limitations

Takeaways:
A novel adaptive preference optimization algorithm (α-DPO) is presented to address the efficiency and stability issues of RLHF.
Improvement of __T5500_____ over existing methods (DPO, SimPO) through dynamic compensation margin.
Proving the superiority of α-DPO through theoretical guarantees and experimental results.
Significant contributions to the field of LLM alignment.
Reproducibility achieved through public code.
Limitations:
Further analysis of the algorithm's complexity and computational cost is needed.
Generalization performance needs to be verified on various LLM architectures and datasets.
Further studies are needed on long-term safety and potential side effects.
👍