[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Created by
  • Haebom

Author

Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Outline

This paper presents three Limitations (performance plateau, limited effectiveness of self-reflection, and persistent failure) drawbacks of reinforcement learning (RL) using only numerical feedback, and proposes Critique-GRPO, a novel RL framework that integrates natural language criticism to overcome them. Critique-GRPO performs policy optimization by simultaneously utilizing numerical feedback and natural language criticism, and in particular, it uses a shaping function that reinforces the reward for correct answers and penalizes incorrect answers. Experimental results using Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B models show that Critique-GRPO outperforms conventional supervised learning and RL-based fine-tuning methods on eight different inference tasks, and is especially effective in self-improvement through self-criticism and transfer learning from weak to strong generalization.

Takeaways, Limitations

Takeaways:
We present the possibility of integrating natural language criticism to solve the problem of reinforcement learning for large-scale language models, which is limited by numerical feedback alone.
Critique-GRPO achieves better performance than existing RL methods. It is particularly effective in improving performance through self-criticism and improving generalization performance.
Shows improved performance on various types of reasoning problems (mathematics, STEM, general reasoning).
Limitations:
Critique-GRPO is proposed as a solution to the three Limitations (performance plateau, limited effectiveness of self-reflection, and persistent failure) presented, but other types of Limitations are not considered.
The effectiveness of Critique-GRPO may be limited to certain models and tasks. Additional experiments on a variety of models and tasks are needed.
There may be a high dependence on the quality of natural language criticism. There is a possibility of performance degradation when the quality of criticism deteriorates.
👍