Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Created by
  • Haebom

Author

Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, Ling Pan

Outline

This paper introduces Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critic role while maintaining efficiency in Reinforcement Learning (RL) research for Large Language Models (LLM). AsyPPO leverages lightweight mini-critics to improve learning stability and performance, outperforming existing robust baselines such as GRPO and PPO.

Takeaways, Limitations

Takeaways:
A new architecture is presented that re-emphasizes the role of critic in the LLM environment.
Improve performance while maintaining computational efficiency by utilizing lightweight mini-critics.
Improving policy updates by leveraging inter-critic uncertainty.
Achieves performance that surpasses existing baselines across various benchmarks.
Limitations:
Trained with as few as 5,000 samples.
Lack of additional information about specific settings and datasets.
Further research is needed to determine whether the results of this paper can be generalized to other LLM models and tasks.
👍