Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model

Created by
  • Haebom

Author

Andris Ambainis, Joao F. Doriguello, Debbie Lim

Outline

This paper proposes novel classical and quantum online algorithms for finite-horizon and infinite-horizon mean-reward Markov decision processes (MDPs). The proposed algorithm is based on a hybrid exploratory-generative reinforcement learning (RL) model in which agents can freely interact with the environment, sometimes through generative sampling (i.e., accessing a "simulator"). By using both classical and quantum algorithms to approximate optimal policies in generative models, we demonstrate that by directly computing and using optimal policies, we avoid several RL paradigms, such as "optimism under uncertainty" and "posterior sampling," and obtain better regret bounds than previous studies. For finite-horizon MDPs, the quantum algorithm obtains a regret bound that depends only logarithmically on the number of time steps T, thereby overcoming the classical $O(\sqrt{T})$ limit. This is consistent with the time dependence of previous quantum studies by Ganguly et al. (arXiv'23) and Zhong et al. (ICML'24), but with improved dependence on other parameters, such as the state-space size S and the action-space size A. For infinite-horizon MDPs, the classical and quantum bounds still maintain the $O(\sqrt{T})$ dependence, but have better S and A coefficients. Nevertheless, we propose a new regret metric for infinite-horizon MDPs, demonstrating that quantum algorithms have exponentially better $\operatorname{poly}\log{T}$ regret than classical algorithms. Finally, we generalize all results to compact state spaces.

Takeaways, Limitations

Takeaways:
We present a quantum algorithm that overcomes the $O(\sqrt{T})$ classical limit in finite horizon MDPs.
Avoiding the paradigm of existing reinforcement learning algorithms (optimism, posterior sampling) and directly calculating the optimal policy to improve the regret boundary.
A new regret metric for infinite horizon MDPs and achieving $\operatorname{poly}\log{T}$ regret for quantum algorithms.
Generalizing the results to compact state spaces.
Limitations:
Assumes accessibility to generative models (simulators). Further research is needed to determine their applicability in real-world environments.
Further research is needed on practical implementation and performance evaluation of quantum algorithms.
Optimality for a specific problem setup may not be guaranteed. (Implicit Limitations)
👍