Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Created by
  • Haebom

Author

Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, Ling Pan

Outline

Recent RL for LLM (RL4LLM) methods use an average advantage baseline rather than an explicit critic. This is because traditional value functions are computationally expensive at LLM scale and often fail under sparse rewards and long inference horizons. In this paper, we revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the role of the critic while maintaining efficiency even in large-scale model settings. AsyPPO uses a lightweight mini-critic set trained on individual prompt shards. This design promotes diversity while maintaining calibration, thereby reducing value estimation bias. In addition to robust estimation, AsyPPO leverages inter-critic uncertainty to improve policy updates: (i) by masking advantage in states where critics agree and gradients add little learning signal, and (ii) by filtering out states with high divergence in entropy regularization to suppress unnecessary exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks compared to strong baselines like GRPO. It achieves over 6% performance gains on Qwen3-4b-Base and 3% gains on Qwen3-8b-Base and Qwen3-14b-Base compared to classic PPO without any additional tricks. These results highlight the importance of architectural innovation for scalable and efficient algorithms.

Takeaways, Limitations

Takeaways:
AsyPPO effectively restores the role of critic in RL for LLM, thereby improving learning stability and performance.
The mini-critic architecture contributed to reducing value estimation bias while maintaining computational efficiency.
Policy updates that leverage inter-critic uncertainty have improved learning efficiency.
Performance improvements were observed across various model sizes (Qwen3-4b, 8b, 14b).
Limitations:
The scope of the datasets and benchmarks used in the experiments may be limited.
Further research is needed to determine how the improved performance can generalize to other LLM architectures and tasks.
Further analysis of the design and training methods of mini-critics may be required.
Although only 5,000 samples were used, more data may be required in real-world applications.
👍