This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Created by
Haebom
Author
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
Outline
Diffusion large-scale language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards using reinforcement learning (RL) is challenging. This is because standard policy gradient methods cannot be directly applied due to the intractable log-likelihood. Previous studies have used surrogates such as the lower-evidence bound (ELBO), but these one-way approximations can lead to significant policy gradient bias. To address this, this study proposes the sandwich policy gradient (SPG), which utilizes both the upper and lower bounds of the true log-likelihood. Experimental results show that SPG significantly outperforms baselines based on ELBO or single-step estimation. Specifically, SPG outperforms state-of-the-art dLLM RL methods by 3.6% on GSM8K, 2.6% on MATH500, 18.4% on Countdown, and 27.0% on Sudoku.
Takeaways, Limitations
•
Takeaways:
◦
We propose SPG, a novel method for reducing policy gradient bias in reinforcement learning of dLLM.
◦
Significantly improves performance over existing ELBO-based methods.
◦
Achieves state-of-the-art results in a variety of mathematical and reasoning tasks.
•
Limitations:
◦
The specific Limitations is not explicitly mentioned in the paper (however, due to the nature of dLLM, its computational complexity may be high).