Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Created by
  • Haebom

Author

Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye

Outline

This paper addresses the challenge of securing high-quality training data for fine-tuning language models. Specifically, we experimentally study how to prioritize data of different difficulty levels (easy, medium, difficult, and random) under budget constraints using Group Relative Policy Optimization (GRPO) fine-tuning across a variety of model sizes and types. Using difficulty estimates obtained from multi-sample evaluations of the base model, we compare and analyze four subset selection policies selected from the same unlabeled data pool. Experimental results show that training with the most difficult examples yields up to 47% performance gains, while easy examples yield the least performance gains. This is likely due to the fact that difficult examples provide more learning opportunities during GRPO training. In conclusion, we provide practical guidance that prioritizing difficult examples in budget-constrained inference tasks using GRPO can significantly improve performance.

Takeaways, Limitations

Takeaways: A data selection strategy for fine-tuning language models within a limited budget, prioritizing difficult examples, reveals the most effective way to improve performance. This effect is further enhanced when using the GRPO technique. This provides valuable guidance for establishing data acquisition strategies in practical applications.
Limitations: This study is limited to the GRPO technique, and further research is needed to determine its generalizability to other fine-tuning techniques. The difficulty measurement method has limitations, and generalizability to a variety of datasets and tasks needs to be verified. Because the results are based on a specific model and task, generalization to other models and tasks may be limited.
👍