Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GTPO: Trajectory-Based Policy Optimization in Large Language Models

Created by
  • Haebom

Author

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino

Outline

This paper analyzes two key problems of existing Group-relative Policy Optimization (GRPO): (i) conflicting gradient updates that occur when tokens receive both positive and negative rewards, and (ii) the problem that negatively rewarded final versions penalize confident responses and shift model decisions toward less probable tokens, flattening the output distribution and impeding learning. To address these issues, this paper proposes Group-relative Trajectory-based Policy Optimization (GTPO), which identifies conflicting tokens and amplifies positive updates while skipping negative ones. Furthermore, it prevents policy collapse by filtering final versions with entropy exceeding a certain threshold. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training. We demonstrate improved performance and stability through multiple experiments on the GSM8K, MATH, and AIME 2024 benchmarks.

Takeaways, Limitations

Takeaways:
We clarify the Limitations of GRPO and propose GTPO, a new policy optimization method that improves it.
GTPO achieves stable learning and performance improvement without KL-divergence regularization.
Experimentally verifying the superiority of GTPO on GSM8K, MATH, and AIME 2024 benchmarks.
Increased efficiency by enabling learning without a reference model.
Limitations:
Further analysis and optimization of GTPO's entropy threshold setting is needed.
Further experiments with different types of language models and benchmarks are needed.
A more detailed explanation of the theoretical basis for the proposed entropy threshold is needed.
👍