Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GTPO: Trajectory-Based Policy Optimization in Large Language Models

Created by
  • Haebom

Author

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

Outline

This paper identifies and analyzes two major problems in traditional group-relative policy optimization (GRPO): (i) tokens frequently appearing in completions with both positive and negative rewards, leading to conflicting gradient updates and decreasing output probabilities, and (ii) negatively rewarded completions penalize confident responses and shift model decisions to unlikely tokens, flattening the output distribution and impairing learning. To address these problems, this paper proposes group-relative trajectory-based policy optimization (GTPO). GTPO identifies conflicting tokens that co-occur in completions with conflicting rewards and protects them by amplifying positive updates while skipping negative ones. Furthermore, to prevent policy collapse, GTPO filters completions whose entropy exceeds a provably high threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, so it does not require a reference model during training. Multiple experiments on the GSM8K, MATH, and AIME 2024 benchmarks demonstrate that GTPO provides greater training stability and improved performance.

Takeaways, Limitations

Takeaways:
We clarify the Limitations of GRPO and propose GTPO, a new policy optimization method that improves it.
GTPO simplifies the training process and increases efficiency by eliminating the need for KL-divergence regularization.
Experimentally verifying GTPO's superior performance on GSM8K, MATH, and AIME 2024 benchmarks.
Provides a more stable and effective large-scale language model training and alignment strategy.
Limitations:
Further analysis and optimization of GTPO's entropy threshold settings may be required.
Further research is needed to determine the generality of the proposed method and its applicability to various model architectures.
Experimental results are limited to a specific benchmark, and performance on other tasks or datasets requires further validation.
👍