This paper proposes a novel reinforcement learning (RL) formulation for training a continuous-time score-based diffusion model for generative AI. This formulation generates samples that maximize a reward function while keeping the generated distribution close to the unknown target data distribution. Unlike previous studies, we do not attempt to learn a score function or use a pre-trained model for the score function of an unknown, noisy data distribution. Instead, we formulate the problem as an entropy-regularized continuous-time RL and show that the optimal probabilistic policy has a Gaussian distribution with a known covariance matrix. Based on this result, we parameterize the mean of the Gaussian policy and develop an actor-critic type (small) q-learning algorithm to solve the RL problem. A key element of the algorithm design is to obtain noisy observations from the unknown score function via a rate estimator. This formulation can also be applied to pure score matching and fine-tuning pre-trained models. Numerically, we demonstrate the effectiveness of our approach by comparing its performance with two state-of-the-art RL methods for fine-tuning pre-trained models on several generative tasks, including high-dimensional image generation. Finally, we discuss the probabilistic flow ODE implementation of the diffusion model and the extension of the RL formulation to the conditional diffusion model.