This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper emphasizes the importance of distributed learning to overcome the limitations of single-center computing, focusing specifically on reinforcement learning (RL) post-training of large-scale language models (LLMs). To address the challenges inherent in heterogeneous distributed environments due to the tight coupling of the sampling-training cycles in conventional RL, we propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling and parameter learning. We identify the problem of high variance caused by KL divergence due to network delays, which leads to importance sampling failures. We propose the Group Expectation Policy Optimization (GEPO) algorithm, which reduces importance weight variance through an improved sampling mechanism. GEPO theoretically achieves exponential variance reduction, and experimental results demonstrate that it exhibits less than 3% performance degradation even under 1,800-second delays, while maintaining superior stability compared to existing methods such as GRPO. This suggests the powerful potential of distributed RL in heterogeneous networks.
Takeaways, Limitations
•
Takeaways:
◦
We present an efficient post-training method for large-scale language models using reinforcement learning in heterogeneous distributed environments.
◦
We propose HeteroRL, an asynchronous RL architecture robust to network delays, and GEPO, an efficient sampling technique.
◦
GEPO theoretically achieves exponential variance reduction and has been experimentally verified to have excellent stability.
◦
Presenting new possibilities for large-scale language model training and deployment in distributed environments.
•
Limitations:
◦
The performance improvements of GEPO may be limited to certain network environments or certain types of LLM.
◦
Due to limitations in the experimental environment, further verification of generalization performance in a real distributed environment is required.
◦
Further research is needed on the scalability of HeteroRL and its applicability to other distributed learning environments.