A challenge when applying reinforcement learning (RL) to probabilistic language models (dLLMs) is the computational inability of the likelihood function, which is essential for RL purposes. Existing methods approximate the likelihood using the lower bound on evidence (ELBO), but this incurs significant memory overhead at each training step. This paper proposes Boundary-Guided Policy Optimization (BGPO), a memory-efficient RL algorithm. BGPO maximizes a specially constructed lower bound on the ELBO-based objective function and satisfies two key properties: linearity and equivalence. Experimental results show that BGPO outperforms previous RL algorithms on mathematical problem solving, code generation, and planning tasks.