Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

MAPO: Mixed Advantage Policy Optimization

Created by
  • Haebom

Author

Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao

Outline

This paper discusses recent advances in reinforcement learning, such as Group Relative Policy Optimization (GRPO), which has significantly improved the performance of inference tasks for underlying models. In GRPO, an advantage function is used as the central mechanism for ranking trajectory importance. However, existing studies suffer from advantage reversal and advantage mirroring problems, which hinder rational advantage allocation. In this paper, we propose Mixed Advantage Policy Optimization (MAPO), a simple yet effective GRPO strategy. We identify that trajectories appear with different certainties and propose an advantage percentage deviation for samples with high certainty trajectories. Furthermore, we adaptively configure the advantage function to account for sample characteristics by dynamically rebalancing the weights of the advantage function for samples with different trajectory certainties. We validate the effectiveness of our approach through comparisons with related state-of-the-art methods and ablation studies on various advantage variations.

Takeaways, Limitations

Takeaways: MAPO is an effective strategy to improve the inference performance of the underlying model by mitigating the advantage inversion and advantage mirroring problems of GRPO. We demonstrate that dynamic weight rebalancing, taking into account sample characteristics, allows for the construction of more sophisticated advantage functions.
Limitations: The effectiveness of MAPO presented in this paper may be limited to specific underlying models and inference tasks. Additional experiments are needed on a variety of underlying models and tasks. Furthermore, further research is needed on the generalizability of the advantage percentage deviation and dynamic weight rebalancing strategies.
👍