This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning
Created by
Haebom
Author
Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada
Outline
This paper focuses on actor-critic methods for reinforcement learning in continuous action spaces. Existing continuous action space reinforcement learning algorithms use the Bellman operator to model the Q-value of the current policy, but do not model the optimality function. This leads to poor sample efficiency. This study investigates the effectiveness of integrating the Bellman optimality operator into the actor-critic framework. Experiments in a simple environment demonstrate that optimality modeling accelerates learning but introduces overestimation bias. To address this, we propose an annealing technique that gradually transitions from the Bellman optimality operator to the Bellman operator. Combined with TD3 and SAC, our method outperforms existing methods on a variety of movement and manipulation tasks and exhibits robustness to optimality-related hyperparameters. The code is available at https://github.com/motokiomura/annealed-q-learning .
Takeaways: We demonstrate that an annealing technique utilizing the Bellman optimality operator improves sample efficiency in continuous action space reinforcement learning and enhances the performance of existing algorithms such as TD3 and SAC. This improves robustness to optimality-related hyperparameters.
•
Limitations: The effectiveness of the proposed method was verified based on experimental results in a simple environment. Therefore, additional experiments in more complex and diverse environments are needed. Further analysis is needed to determine whether the annealing technique fully addresses the overestimation bias caused by the use of the Bellman optimality operator.