This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
FlowRL: Matching Reward Distributions for LLM Reasoning
Created by
Haebom
Author
Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo
Outline
FlowRL proposes a method to align the overall reward distribution through flow balancing, rather than maximizing reward in large-scale language model (LLM) reinforcement learning (RL). Existing reward maximization methods tend to over-optimize the dominant reward signal, ignoring less frequent but valid inference paths and reducing diversity. FlowRL transforms the scalar reward into a normalized target distribution using a learnable partition function, then minimizes the inverse KL divergence between the policy and target distributions. FlowRL is implemented as a flow-balancing optimization method that encourages diverse exploration and generalizable inference trajectories. In experiments on mathematical and code inference tasks, FlowRL achieves significant performance gains of 10.0% on average over GRPO and 5.1% over PPO, and consistently outperforms the code inference task.
Takeaways, Limitations
•
Takeaways:
◦
In LLM RL, we present a method for matching reward distributions as a key step for efficient exploration and diverse inference.
◦
It achieves higher performance than existing methods (GRPO, PPO) on mathematical and code inference tasks.
◦
Encourages diverse exploration and generalizable inference trajectories.
•
Limitations:
◦
There is no specific mention of Limitations in the paper.