In research on Reinforcement Learning with Verifiable Rewards (RLVR), which improves the inference ability of language models using reinforcement learning, we aim to overcome the limitations of existing approaches that rely on self-exploration or a single offline tutor. In this paper, we propose the Adaptive Multi-Guidance Policy Optimization (AMPO) framework, which introduces a "demand-based guidance" approach in which a student model receives guidance from multiple skilled tutor models only when it fails to generate the correct answer. AMPO balances extensive exploration with effective utilization by expanding exploration, preserving the value of self-discovery, and encouraging the student model to learn from inference paths it is likely to understand.