Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Created by
  • Haebom

Author

Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang

Outline

This paper analyzes the limitations of Group Relative Policy Optimization (GRPO), a reinforcement learning method for strengthening large-scale inference models (LRMs), and proposes Discriminative Constrained Optimization (DisCO), a novel framework to improve upon it. Based on discriminative learning principles, DisCO aims to eliminate question-level difficulty bias, ensure training stability, and address data imbalance. Experimental results show that DisCO outperforms GRPO and Differentiable Actor Policy Optimization (DAPO) in improving the mathematical inference ability of models based on Supervised Fine-tuning (SFT).

Takeaways, Limitations

Takeaways:
We have addressed the difficulty bias issue in GRPO and improved training stability.
We suggest the possibility of solving the data imbalance problem through a discriminatory learning method.
It showed superior performance compared to GRPO and DAPO in improving mathematical reasoning ability.
Limitations:
Only experimental results for the 1.5B model are presented, so performance verification on models of different scales is required.
Additional information is needed regarding the specific implementation of DisCO.
Further validation of its applicability to other types of LRM and other tasks is required.
👍