Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Created by
  • Haebom

Author

Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gael Gendron, Libo Zhang, Jiamou Liu, Michael J. Witbrock

Outline

Inverse reinforcement learning (IRL) learns a reward function to account for expert demonstrations. Modern IRL methods often use an adversarial (min-max) formulation that alternates between reward and policy optimization, but this often leads to unstable learning. Recent non-adversarial IRL approaches have improved stability by jointly learning rewards and policies using energy-based formulations, but lack formal guarantees. This study addresses this gap. First, we present a unified view that standard non-adversarial methods explicitly or implicitly maximize the likelihood of expert actions, which is equivalent to minimizing the expected return difference. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement of this likelihood through a minimization-maximization process. We implement TRRO as Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides an IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or outperforms state-of-the-art baselines in reward recovery, policy imitation, and high sample efficiency on MuJoCo and Gym-Robotics benchmarks, as well as on real-world animal behavior modeling tasks.

Takeaways, Limitations

Takeaways:
Presenting a unified view of non-adversarial IRL methods: maximizing the likelihood of expert actions.
Proposed Trust Region Reward Optimization (TRRO) framework: ensuring monotonic improvement of likelihood through Minorization-Maximization process.
Development of the Proximal Inverse Reward Optimization (PIRO) algorithm: A practical implementation of TRRO.
Theoretical guarantee: Providing an IRL counterpart to TRPO.
Experimental Advantage: MuJoCo, Gym-Robotics, Outperforms Existing Methods in Modeling Real Animal Behavior.
Limitations:
There is no Limitations specified in the paper.
👍