Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Created by
  • Haebom

Author

Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang

Outline

This paper proposes domain-aware self-consistent policy optimization (DISCO) to address the Limitations problem of group-relative policy optimization (GRPO). GRPO, a reinforcement learning from human feedback (RLHF) method, demonstrates excellent performance without learning a value function. However, when applied to imbalanced multi-domain data, such as real-world datasets, it suffers from biased learning toward dominant domains. DISCO addresses these issues through two innovative methods: domain-specific reward adjustment and difficulty-based reward adjustment. Domain-specific reward adjustment considers domain frequencies to readjust rewards, while difficulty-based reward adjustment leverages prompt-level self-consistency to prioritize learning on uncertain prompts, promoting fairer and more effective policy learning. Experimental results demonstrate that DISCO outperforms existing GRPO variants by 5% on various LLM and imbalanced datasets, and achieves state-of-the-art results on multi-domain alignment benchmarks.

Takeaways, Limitations

Takeaways:
An effective solution to the LLM alignment problem in imbalanced multi-domain data.
Overcoming GRPO's Limitations and achieving performance improvements
Demonstrating the Effectiveness of Domain- and Difficulty-Based Reward Adjustment Strategies
Achieving new state-of-the-art performance in multi-domain alignment benchmarks.
Supporting reproducibility and follow-up research through open code and data
Limitations:
Further verification of the generalization performance of the proposed method is needed.
Extensive experiments are needed on various types of imbalanced datasets.
Further comparative analysis with other RLHF methods is needed.
The subjectivity of domain and difficulty definitions and the resulting impact analysis are needed.
👍