Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Created by
  • Haebom

Author

Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

Outline

This paper proposes Fair Sequence Policy Optimization (FSPO), a sequence-level reinforcement learning method for large-scale language models (LLMs). FSPO applies length-fair clipping to importance sampling (IS) weights to address the problem that conventional PPO/GRPO clipping methods, when applied to sequences, systematically reweight short and long responses, distorting the optimization direction. FSPO presents a simple method for clipping the sequence log-IS ratio into a band proportional to $\sqrt{L}$. Theoretically, we formalize length fairness through the length reweighting error (LRE) and prove that a small LRE guarantees the cosine direction between clipped and actual updates. Experimentally, we demonstrate that FSPO flattens the clipping ratio across length intervals, stabilizes training, and outperforms all baseline models on multiple evaluation datasets against the Qwen3-8B-Base model.

Takeaways, Limitations

Takeaways:
We propose a novel reinforcement learning method, FSPO, to address the problem of weight distortion depending on sequence length.
Demonstrating the effectiveness of a length-fair clipping technique using $\sqrt{L}$ scaling.
The validity of the method is supported by theoretical analysis using LRE.
Experimentally verified performance improvement over existing methods on various evaluation datasets.
Contribute to improving the stability of LLM training.
Limitations:
The effectiveness of the proposed method may be limited to a specific LLM (Qwen3-8B-Base) and dataset.
Generalization performance when applied to other types of LLM or larger-scale models requires further study.
The optimal value of $\sqrt{L}$ scaling may vary depending on the model and dataset.
The need to evaluate length fairness through indicators other than LRE.
👍