Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Monotone and Conservative Policy Iteration Beyond the Tabular Case

Created by
  • Haebom

Author

SR Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal

Outline

This paper introduces Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative Policy Iteration (CPI), which maintain tabular guarantees under function approximation. RPI uses a novel Bellman constraint optimization for policy evaluation, restoring textbook-like monotonicity in value estimation and guaranteeing lower bounds on true returns. CRPI shares RPI's evaluation method but conservatively updates policies by maximizing a new lower bound on performance difference that explicitly accounts for errors due to function approximation. CRPI inherits RPI's guarantees and allows for stepwise improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. This study addresses the fundamental problem that widely used algorithms such as TRPO and PPO, which are derived from tabular CPI, fail to meet CPI's guarantees under function approximation, leading to divergence, oscillation, or convergence to suboptimal policies.

Takeaways, Limitations

RPI restores monotonicity in value estimation and guarantees a lower bound on the actual return value.
CRPI provides stepwise improvement boundaries.
RPI and CRPI restore PI/CPI-style guarantees for arbitrary function classes.
In initial simulations, it showed better performance than PI and variational algorithms.
The study addresses issues with algorithms such as TRPO and PPO.
The Limitations of this paper is not specifically presented. (The abstract does not contain such information.)
👍