[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Learning to Reason at the Frontier of Learnability

Created by
  • Haebom

Author

Thomas Foster, Anya Sims, Johannes Forkel, Mattie Fellows, Jakob Foerster

Outline

This paper points out that in the reinforcement learning stage of large-scale language model (LLM) training, especially in inference tasks such as mathematical problems, two algorithms (PPO and VinePPO) have a problem that many problems are solved in all attempts (if they have already been learned) or none of them are solved (if there is no meaningful training signal), resulting in low efficiency. To address this, we propose a curriculum that preferentially trains problems with high success rate variance (if they sometimes succeed but not always succeed) by applying the 'sampling for learnability' technique used in the reinforcement learning literature to the reinforcement learning stage of LLM training. Experimental results show that this curriculum consistently improves training performance across multiple algorithms and datasets.

Takeaways, Limitations

Takeaways:
Presenting a new curriculum learning method to improve the training efficiency of reinforcement learning in LLM.
Maximize learning efficiency by focusing on problems with high variance in success rates.
Shows consistent performance improvements across a variety of algorithms and datasets.
Presenting a new direction for more efficient and effective LLM reinforcement learning.
Limitations:
Further studies are needed to determine whether the proposed method is applicable to all types of LLM training tasks.
The results may be limited to specific algorithms and datasets.
Further research is needed on parameter optimization of the 'sampling for learnability' technique.
Need to evaluate applicability to other reinforcement learning algorithms.
👍