Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Reward-Directed Score-Based Diffusion Models via q-Learning

Created by
  • Haebom

Author

Xuefeng Gao, Jiale Zha, Xun Yu Zhou

Outline

This paper proposes a novel reinforcement learning (RL) formulation for training a continuous-time score-based diffusion model for generative AI. This formulation generates samples that maximize a reward function while keeping the generated distribution close to the unknown target data distribution. Unlike previous studies, we do not attempt to learn a score function or use a pre-trained model for the score function of an unknown, noisy data distribution. Instead, we formulate the problem as an entropy-regularized continuous-time RL and show that the optimal probabilistic policy has a Gaussian distribution with a known covariance matrix. Based on this result, we parameterize the mean of the Gaussian policy and develop an actor-critic type (small) q-learning algorithm to solve the RL problem. A key element of the algorithm design is to obtain noisy observations from the unknown score function via a rate estimator. This formulation can also be applied to pure score matching and fine-tuning pre-trained models. Numerically, we demonstrate the effectiveness of our approach by comparing its performance with two state-of-the-art RL methods for fine-tuning pre-trained models on several generative tasks, including high-dimensional image generation. Finally, we discuss the probabilistic flow ODE implementation of the diffusion model and the extension of the RL formulation to the conditional diffusion model.

Takeaways, Limitations

Takeaways:
We present a novel RL formulation for training continuous-time score-based diffusion models without pre-trained models.
Development of an efficient algorithm using the Gaussian distribution characteristics of optimal policies.
Effective learning through acquisition of noisy observations using a ratio estimator.
It can also be applied to pure score matching and fine-tuning of pre-trained models.
Demonstrated superior performance compared to existing methods in various generation tasks, including high-dimensional image generation.
Suggestion of extensions to stochastic flow ODEs and conditional diffusion models.
Limitations:
Further experiments and analysis are needed to determine the generalization performance of the proposed method.
Further research is needed on scalability and computational costs for high-dimensional data.
Performance may be affected by the accuracy of the ratio estimator.
Performance limitations in situations where there is no information at all about the unknown scoring function.
👍