Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VL Norm: Rethink Loss Aggregation in RLVR

Created by
  • Haebom

Author

Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu

Outline

This paper aims to improve the inference performance of Large Language Models (LLMs) with dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR) environments. To address the high gradient variance problem, we propose Variance-reduced Length-dependent Normalization (VL Norm). VL Norm is designed to find unbiased estimates with minimal variance, theoretically providing unbiased estimates and minimizing gradient variance. With its simple implementation, it overcomes the limitations of existing methods and demonstrates excellent performance in various experiments. In particular, when integrated into the DAPO algorithm, it achieves up to 2.67 times faster convergence in the CountDown task.

Takeaways, Limitations

Takeaways:
We present a novel loss aggregation methodology to address learning instability issues due to dynamic generation length in RLVR environments.
Improved learning efficiency through unbiased estimation and gradient variance minimization.
High accessibility with simple implementation (less than 10 lines of code changes).
Consistent performance improvements across a variety of model sizes, maximum lengths, and tasks.
Demonstrated performance improvements through integration with state-of-the-art RL algorithms (integration with DAPO).
Limitations:
Lack of detailed information about the specific experimental environment and dataset.
Further analysis of the generalization performance of VL Norm is needed.
Further research is needed on the applicability and effectiveness to other RLVR environments and tasks.
👍