Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Created by
  • Haebom

Author

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

Outline

A challenge when applying reinforcement learning (RL) to probabilistic language models (dLLMs) is the computational inability of the likelihood function, which is essential for RL purposes. Existing methods approximate the likelihood using the lower bound on evidence (ELBO), but this incurs significant memory overhead at each training step. This paper proposes Boundary-Guided Policy Optimization (BGPO), a memory-efficient RL algorithm. BGPO maximizes a specially constructed lower bound on the ELBO-based objective function and satisfies two key properties: linearity and equivalence. Experimental results show that BGPO outperforms previous RL algorithms on mathematical problem solving, code generation, and planning tasks.

Takeaways, Limitations

Takeaways:
Proposal of a Memory-Efficient RL Algorithm (BGPO)
Resolves memory issues that occur when applying RL to dLLM
Achieve superior performance over existing methods in solving math problems, generating code, and planning tasks.
Accurate likelihood approximation possible using large MC sample sizes
Limitations:
The specific Limitations is not specified in the paper (however, since it is an algorithm specialized for dLLM, there may be limitations in generalizing to other fields).
👍