This paper proposes novel classical and quantum online algorithms for finite-horizon and infinite-horizon mean-reward Markov decision processes (MDPs). The proposed algorithm is based on a hybrid exploratory-generative reinforcement learning (RL) model in which agents can freely interact with the environment, sometimes through generative sampling (i.e., accessing a "simulator"). By using both classical and quantum algorithms to approximate optimal policies in generative models, we demonstrate that by directly computing and using optimal policies, we avoid several RL paradigms, such as "optimism under uncertainty" and "posterior sampling," and obtain better regret bounds than previous studies. For finite-horizon MDPs, the quantum algorithm obtains a regret bound that depends only logarithmically on the number of time steps T, thereby overcoming the classical $O(\sqrt{T})$ limit. This is consistent with the time dependence of previous quantum studies by Ganguly et al. (arXiv'23) and Zhong et al. (ICML'24), but with improved dependence on other parameters, such as the state-space size S and the action-space size A. For infinite-horizon MDPs, the classical and quantum bounds still maintain the $O(\sqrt{T})$ dependence, but have better S and A coefficients. Nevertheless, we propose a new regret metric for infinite-horizon MDPs, demonstrating that quantum algorithms have exponentially better $\operatorname{poly}\log{T}$ regret than classical algorithms. Finally, we generalize all results to compact state spaces.