This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes Omni-DPO to improve the efficiency of Direct Preference Optimization (DPO) in human-feedback-driven reinforcement learning (RLHF). Existing DPO approaches suffer from the limitation of treating all preference pairs equally. Omni-DPO is a dual-perspective optimization framework that simultaneously considers the inherent quality of each preference pair and the model's learning performance. By adaptively adjusting sample weights based on data quality and model learning dynamics, it achieves efficient training data utilization and improved performance. Experimental results on various models and benchmarks demonstrate the superiority of Omni-DPO and its generalization performance. In text comprehension tasks, it outperforms Claude 3 Opus by 6.7 points on the Arena-Hard benchmark, and consistently outperforms baseline methods in mathematical reasoning tasks.
Takeaways, Limitations
•
Takeaways:
◦
Fixed an issue where DPO's Limitations did not consider qualitative differences in preference pairs.
◦
More efficient use of training data is now possible by considering data quality and model learning dynamics.
◦
It outperforms existing methods in various benchmarks (notably, it significantly outperforms Claude 3 Opus in the Arena-Hard benchmark).
◦
It demonstrates strong performance on both text comprehension and mathematical reasoning tasks, demonstrating the effectiveness and robustness of the approach.
◦
Reproducibility was achieved through publicly available code and models.
•
Limitations:
◦
Limitations presented in this paper is not explicitly mentioned. Further research is needed to more broadly verify the applicability and limitations of Omni-DPO.
◦
Since only results for a specific benchmark are presented, further evaluation of generalization performance for other domains or tasks is required.