Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

On Predictability of Reinforcement Learning Dynamics for Large Language Models

Created by
  • Haebom

Author

Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, Junfeng Fang

Outline

The inference performance of large-scale language models (LLMs) is significantly enhanced by reinforcement learning (RL), but the underlying parameter dynamics during RL training remain poorly understood. This study identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 dominance, where the upper singular subspace of the parameter update matrix almost completely determines the inference improvement, recovering over 99% of the performance gain; and (2) Rank-1 linear dynamics, where the dominant subspace evolves linearly throughout training, enabling accurate predictions at initial checkpoints. Extensive experiments on eight LLMs and seven algorithms validate the generalizability of these properties. Building on these results, we propose AlphaRL, a plug-in acceleration framework that uses a short initial training period to extrapolate final parameter updates. This approach achieves up to 2.5x speedup while maintaining over 96% of the inference performance without requiring additional modules or hyperparameter tuning. It is a versatile and practical tool for large-scale RL and opens the way to a principled, interpretable, and efficient training paradigm for LLM.

Takeaways, Limitations

Takeaways:
In LLM, we discovered fundamental properties of parameter updates during RL-based training: rank-1 dominance and rank-1 linear dynamics.
Based on these findings, we propose an AlphaRL framework that improves training speed by up to 2.5 times.
AlphaRL maintains over 96% inference performance without any additional modules or hyperparameter tuning.
Contribute to the interpretability, efficiency and principled approach of LLM training.
Limitations:
Generalizability to specific RL algorithms or LLM structures requires further research.
Further validation is needed to ensure that AlphaRL consistently delivers performance improvements across all LLM and RL settings.
A deeper understanding of the fundamental principles of rank-1 dominance and rank-1 linear dynamics is needed.
👍