Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Scalable Policy-Based RL Algorithms for POMDPs

Created by
  • Haebom

Author

Ameya Anjarlekar, Rasoul Etesami, R Srikant

Outline

The continuous nature of the belief states of a POMDP poses significant computational challenges for optimal policy learning. In this paper, we consider an approach to solve the PORL problem by approximating the POMDP model as a finite-state MDP (Superstate MDP). We present theoretical guarantees linking the optimal value function of the Superstate MDP with the optimal value function of the original POMDP, achieving improved results compared to existing studies. Next, we propose a policy-based learning approach that utilizes linear function approximation to learn optimal policies for Superstate MDPs. The proposed approach treats the POMDP as an MDP and demonstrates that it can be approximately solved using TD learning and policy optimization. Here, the MDP states correspond to a finite history, and the approximation error decays exponentially with the length of this history. We also present a finite-time bound that explicitly quantifies the error incurred when applying standard TD learning in environments where the true dynamics are non-Markovian.

Takeaways, Limitations

Takeaways:
A novel approach to solving the PORL problem by approximating the POMDP to a finite-state MDP is presented.
Presenting theoretical guarantees on the relationship between the optimal value function of the superstate MDP and the optimal value function of the original POMDP, an improvement over previous studies.
A method for approximately solving POMDP using TD-learning and policy optimization is proposed.
Show that the approximation error decreases exponentially with history length.
Presenting a finite-time bound that explicitly quantifies the error incurred when applying TD learning in a non-Markovian environment.
Limitations:
Further experimental validation of the performance of policy-based learning approaches is needed.
Scalability evaluation is required when applying the proposed method to real complex PORL problems.
Lack of comparative analysis with other approximation techniques
👍