Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

Created by
  • Haebom

Author

Maris FL Galesloot, Roman Andriushchenko, Milan \v{C}e\v{s}ka, Sebastian Junges, Nils Jansen

Outline

This paper proposes a Hidden Model Markov Decision Process (HM-POMDP) to address the vulnerability of policies to environmental changes in partially observable Markov decision processes (POMDPs), which model sequential decision-making problems under uncertainty. HM-POMDP represents a set of multiple environment models (POMDPs) with common action and observation spaces. It assumes that the true environment model is hidden among several candidate models, and that the actual environment model is unknown at runtime. To compute robust policies that achieve sufficient performance within each POMDP, this paper combines (1) a deductive formal verification technique that supports inferable robust policy evaluation by computing the worst-case POMDP within the HM-POMDP, and (2) an ascent-descent method to optimize candidate policies for the worst-case POMDP. Experimental results demonstrate that the proposed method generates policies that are more robust and generalize better to unknown POMDPs than existing methods, and is scalable to HM-POMDPs with over 100,000 environments.

Takeaways, Limitations

Takeaways:
We present a novel method for efficiently learning policies that are robust to environmental changes through the HM-POMDP framework.
We show that a combination of deductive formal verification and ascent-by-descent methods enables robust policy generation for large-scale HM-POMDPs.
The proposed method produces policies that are more robust and have better generalization performance than existing methods.
Limitations:
The performance of the proposed method may depend on the choice of POMDP in the worst case. Further research may be needed to efficiently find a POMDP in the worst case.
Further validation of scalability to very complex HM-POMDPs is needed.
Further research is needed to determine its applicability and generalization performance in real-world settings.
👍