Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL

Created by
  • Haebom

Author

Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu

FSPO: Sequence-Level Reinforcement Learning with Length-Processed Importance Sampling Weight Clipping

Outline

This paper proposes Fair Sequence Policy Optimization (FSPO), a sequence-level reinforcement learning method for large-scale language models (LLMs). FSPO applies length fair clipping to importance sampling (IS) weights. We study RL methods using sequence-level IS and find that when PPO/GRPO-style clipping is applied to sequences, fixed clip ranges systematically reweight short and long responses, distorting the optimization direction. FSPO proposes a simple solution: clipping the sequence log-IS ratio into a band scaled by $\sqrt{L}$. Theoretically, we formalize length fairness through the Length Reweighting Error (LRE) and prove that a small LRE guarantees the cosine direction between clipped and actual updates. Empirically, we demonstrate that FSPO smoothes the clip ratio across length bins, stabilizes training, and outperforms baselines across model sizes and evaluation datasets, achieving the greatest benefit on the Qwen3-8B-Base model.

Takeaways, Limitations

Takeaways:
A novel methodology for sequence-level reinforcement learning is presented in LLM.
Improved training stability and performance through length process clipping.
Securing the validity of the methodology through theoretical analysis.
Demonstrated above-baseline performance across a variety of model sizes and datasets.
Limitations:
Information about specific Limitations is not included in the abstract.
👍