[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

Created by
  • Haebom

Author

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, Hanze Dong

Outline

This paper analyzes the cause of the effectiveness of GRPO among reinforcement learning (RL) methods for fine-tuning complex inference tasks of large-scale language models (LLMs). Despite the success of existing GRPO, we point out that the source of its effectiveness is not clear, and reexamine GRPO and propose RAFT, a simple rejection sampling criterion. RAFT learns only positive reward samples and shows similar performance to GRPO and PPO. Experimental results show that the main advantage of GRPO is not reward regularization but discarding prompts with completely incorrect responses, and based on this, we propose Reinforce-Rej, a simplified policy gradient algorithm that filters out completely incorrect or completely correct samples. Reinforce-Rej improves KL efficiency and stability, and RAFT is presented as a robust and interpretable baseline. Future research suggests that instead of indiscriminately using negative samples, we should focus on more principled designs.

Takeaways, Limitations

Takeaways:
We show that the effectiveness of GRPO comes from eliminating incorrect responses, not from reward regularization.
We propose RAFT and Reinforce-Rej, simple and efficient reinforcement learning algorithms.
Presentation of a principled way to utilize negative samples as a future research direction.
Presenting RAFT as a robust and interpretable baseline.
Limitations:
The performance of RAFT and Reinforce-Rej is similar to that of GRPO and PPO. The absolute performance improvement may be limited.
There is a lack of specific methodological suggestions for using negative samples more principledly.
👍