Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DCPO: Dynamic Clipping Policy Optimization

Created by
  • Haebom

Author

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin

Outline

This paper proposes Dynamic Clipping Policy Optimization (DCPO), a novel framework for improving the inference capability of large-scale language models through reinforcement learning. To address the zero-gradient problem of the existing GRPO method, we introduce a dynamic clipping strategy based on token-specific prior probabilities and a smooth advantage normalization technique across the cumulative training phase. DCPO achieves state-of-the-art performance on four benchmarks based on four different models, outperforming existing methods GRPO, DAPO, and GSPO, particularly on the AIME24 and AIME25 benchmarks. Furthermore, it improves the non-zero gradient ratio by an average of 28% compared to GRPO, doubles the training efficiency compared to DAPO, and significantly reduces the token clipping rate.

Takeaways, Limitations

Takeaways:
A novel method for effectively solving the zero-gradient problem in reinforcement learning of large-scale language models is presented.
More efficient use of generated data through dynamic clipping strategies and soft advantage normalization techniques.
Achieves superior performance over existing methods in various benchmarks.
Improved training efficiency and token clipping rate
Limitations:
Further research is needed to determine the generalization performance of the proposed method.
Additional experiments with various models and benchmarks are needed.
Further research is needed on parameter tuning of dynamic clipping strategies.
👍