Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Created by
  • Haebom

Author

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He

Outline

This paper demonstrates the natural emergence of long-range chain of thought (CoT) inference through a simple reinforcement learning (RL) framework using rule-based rewards. This paper applies the zero-RL learning approach of DeepSeek-R1 to various base models. Unlike previous studies that primarily focused on the Qwen2.5 model, we performed zero-RL learning on ten different base models, including LLaMa3-8B, Mistral-7B/24B, DeepSeek-Math-7B, and Qwen2.5-math-7B. Strategies such as formal reward adjustment and query difficulty control significantly improved inference accuracy and response length in most settings. However, monitoring learning dynamics revealed that different base models exhibited unique learning patterns. For example, increased response length did not always correlate with the emergence of specific cognitive behaviors, such as validation. Notably, we observed "aha moments" for the first time in a small-scale model outside the Qwen family. We share core design, research findings, and practical experience that enable successful zero-level RL learning, and open-source code, models, and analysis tools.

Takeaways, Limitations

Takeaways:
We validate the effectiveness of zero-RL learning on various base models and present key design strategies for successful learning.
We achieved improvements in inference accuracy and response length through format compensation adjustment and query difficulty control.
We also observed "aha moments" in small-scale models outside the Qwen family, demonstrating the diversity of model architectures and the applicability of zero-RL learning.
We support further research by open-sourcing our code, models, and analysis tools.
Limitations:
A deeper understanding of the model learning process may be lacking, as evidenced by the lack of a consistent correlation between response length increases and the emergence of cognitive behaviors.
Despite the diversity of underlying models used, biases toward certain model families are likely to exist.
You may need clear criteria for defining and measuring an "aha moment."
👍