Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Created by
  • Haebom

Author

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou

Outline

Klear-Reasoner is a model capable of long-term reasoning, exhibiting careful deliberation during problem-solving and achieving outstanding performance across multiple benchmarks. Existing inference models struggle to reproduce high-performance models due to incomplete disclosure of training details. This paper analyzes the entire process, from data preparation, fine-tuning the long Chain-of-Thought map (long CoT SFT), and reinforcement learning (RL). Experimental results on SFT data demonstrate that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that using challenging samples without accuracy filtering yields better results. Furthermore, to address two key issues with existing RL clipping mechanisms (clipping suppresses important exploration signals and ignores non-optimal paths), we propose Gradient-Preserving Clipping Policy Optimization (GPPO). GPPO smoothly backpropagates gradients from clipped tokens to enhance the model's exploration ability and improve learning from negative samples. Klear-Reasoner demonstrates excellent reasoning skills in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5, and 58.1% on LiveCodeBench V6.

Takeaways, Limitations

Takeaways:
Presenting an effective SFT data strategy utilizing high-quality, small-volume data.
Emphasize the importance of difficult samples
A proposal for a GPPO algorithm that addresses the problems of existing RL clipping mechanisms.
We present the Klear-Reasoner model, which demonstrates excellent performance in solving mathematical and programming problems.
Limitations:
Further verification of the generalization performance of the methodology presented in the paper is needed.
Comparative analysis of the GPPO algorithm with other RL algorithms is needed.
Further research is needed on the scalability and limitations of the Klear-Reasoner model.
👍