Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Created by
  • Haebom

Author

Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Ba\c{s}ar, Mihailo R. Jovanovi c

Outline

This paper studies a sequential decision-making problem in a constrained Markov decision process (MDP) that maximizes expected total reward while satisfying constraints on expected total utility. To solve the infinite horizontal discounting optimal control problem using the natural policy gradient method, we propose the Natural Policy Gradient Primal-Dual (NPG-PD) method. This method updates the primal variable via natural policy gradient ascent and the dual variable via projected subgradient descent. We demonstrate that the proposed method converges globally at a sublinear rate under softmax policy parameterization, despite the non-objective function and nonconvex constraint set for the maximization problem. This convergence is independent of the size of the state-action space, and for log-linear and general smooth policy parameterizations, the sublinear convergence rate is established even when considering the function approximation error due to the restricted policy parameterization. Additionally, we provide convergence and finite sample complexity guarantees for two sample-based NPG-PD algorithms, and demonstrate the effectiveness of our approach through computational experiments.

Takeaways, Limitations

Takeaways:
An Effective Solution to Constrained MDP Problems: Leveraging Natural Policy Gradient Methods to Solve Constrained Sequential Decision Problems.
Global convergence guarantee: Ensures sublinear convergence under optimality gaps and constraint violations under softmax policy parameterization.
Dimensionality independence: Achieve convergence that is independent of the size of the state-action space.
Support for various policy parameterizations: Provides convergence speeds for log-linear and general smooth policy parameterizations.
Sample-based algorithm analysis: Provides convergence and finite sample complexity guarantees for the sample-based NPG-PD algorithm.
Experimental Validation: The effectiveness of the proposed method is demonstrated through experiments.
Limitations:
Policy parameterization dependence: Convergence speed may vary depending on the specific policy parameterization (softmax, log-linear, smooth policy).
Function approximation error: For general smooth policy parameterization, function approximation error may occur due to restricted policy parameterization.
👍