Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Created by
  • Haebom

Author

Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Outline

This paper identifies three major drawbacks of reinforcement learning (RL) that rely solely on numerical rewards: performance plateaus, limited effectiveness of self-reflection, and persistent failures. To overcome these drawbacks, we propose Critique-GRPO, a novel reinforcement learning framework that integrates natural language critique. Critique-GRPO performs policy optimization by simultaneously leveraging both numerical and natural language feedback, and employs a shaping function that reinforces learning for correct corrections and penalizes incorrect ones. Experimental results using the Qwen2.5 and Qwen3 models show that Critique-GRPO consistently outperforms existing supervised learning and RL-based fine-tuning methods on eight challenging mathematics, STEM, and general reasoning tasks, improving the pass@1 scores by approximately 4.4% (Qwen2.5-7B-Base) and 3.8% (Qwen3-8B), respectively, on average. In particular, the self-improvement effect through self-criticism was excellent, achieving a pass@1 improvement of +16.7% compared to GRPO (AIME 2024).

Takeaways, Limitations

Takeaways:
We demonstrate the utility of natural language criticism in solving reinforcement learning problems for large-scale language models where numerical feedback alone is limited.
Critique-GRPO provides a novel reinforcement learning framework that effectively integrates numerical and natural language feedback to achieve performance improvements.
Demonstrates the potential for maximizing performance improvement through self-improvement through self-criticism.
Verified superior performance compared to existing methods in various inference tasks.
Limitations:
Further research is needed on the generalization performance of the proposed method.
Scalability evaluation for LLMs of various sizes and types is needed.
The need to evaluate the dependence of natural language criticism on quality and its robustness.
Further research is needed on setting the optimal parameters of the shaping function.
👍