In this paper, we propose ReMix, an off-policy reinforcement learning technique that addresses the computational inefficiency problem of existing on-policy reinforcement learning techniques to solve the __T83357__ of reinforcement learning (RL) for improving the inference ability of large-scale language models (LLMs). ReMix is designed to utilize off-policy data by leveraging on-policy RFT methods such as PPO and GRPO, and consists of three main components: Mixed-policy proximal policy gradient, KL-Convex policy constraint, and policy reincarnation. Experimental results show that ReMix achieves state-of-the-art performance on various mathematical inference benchmarks while reducing training cost by 30x to 450x compared to existing methods. In addition, we present insightful analyses, such as the short-response preference phenomenon due to the whipping effect of off-policy mismatch and the self-reflective behavioral collapse mode in severe off-policy situations.