Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Created by
  • Haebom

Author

Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye

Example of the impact of difficulty on GRPO tuning effectiveness

Outline

Collecting high-quality training examples for fine-tuning language models is expensive, and budget constraints limit the amount of data that can be acquired. In this study, we compared selection strategies (easy, medium, hard, and random) across multiple models and inference tasks to investigate whether example difficulty affects GRPO training efficiency. Training with the most difficult 10% of examples (the ones on which the base model most frequently fails) resulted in significant performance improvements of up to 47%, while easy examples showed minimal improvements of 3-15%. This is because GRPO requires output distribution to generate training signals. Difficult examples maintain mixed success/failure outcomes throughout training, whereas easy examples quickly converge to consistent success, eliminating learning opportunities. Furthermore, models trained with difficult examples exhibited better out-of-distribution generalization, and only models trained with difficult examples achieved significant gains on the AIME2025 benchmark.

Takeaways, Limitations

Given budget constraints, we should prioritize collecting and annotating examples that the base model struggles with.
Training on difficult examples provides almost all the learning value in GRPO tuning.
Difficult examples contribute to improved out-of-distribution generalization performance.
The study's Limitations is not specified (not mentioned in the Abstract)
👍