Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

Created by
  • Haebom

Author

Xiaoyang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai, Jingkuan Song, Yang Yang, Heng Tao Shen

Outline

In research on Reinforcement Learning with Verifiable Rewards (RLVR), which improves the inference ability of language models using reinforcement learning, we aim to overcome the limitations of existing approaches that rely on self-exploration or a single offline tutor. In this paper, we propose the Adaptive Multi-Guidance Policy Optimization (AMPO) framework, which introduces a "demand-based guidance" approach in which a student model receives guidance from multiple skilled tutor models only when it fails to generate the correct answer. AMPO balances extensive exploration with effective utilization by expanding exploration, preserving the value of self-discovery, and encouraging the student model to learn from inference paths it is likely to understand.

Takeaways, Limitations

AMPO outperforms the strong baseline (GRPO), achieving a 4.3% performance improvement on the mathematical reasoning task and a 12.2% improvement on the external distribution task.
Pass@k performance has been significantly improved and more diverse navigation has been enabled.
AMPO using four equally sized teacher models achieved similar results to approaches using a more powerful single teacher model (e.g., DeepSeek-R1).
The proposed method presents a more efficient and scalable path to superior inference and generalization capabilities.
Limitations is not specifically mentioned in this paper (not included in the abstract).
👍