This paper points out that in the reinforcement learning stage of large-scale language model (LLM) training, especially in inference tasks such as mathematical problems, two algorithms (PPO and VinePPO) have a problem that many problems are solved in all attempts (if they have already been learned) or none of them are solved (if there is no meaningful training signal), resulting in low efficiency. To address this, we propose a curriculum that preferentially trains problems with high success rate variance (if they sometimes succeed but not always succeed) by applying the 'sampling for learnability' technique used in the reinforcement learning literature to the reinforcement learning stage of LLM training. Experimental results show that this curriculum consistently improves training performance across multiple algorithms and datasets.