This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention
Created by
Haebom
Author
Yuxin Chen, Chen Tang, Jianglan Wei, Chenran Li, Ran Tian, Xiang Zhang, Wei Zhan, Peter Stone, Masayoshi Tomizuka
Outline
This paper addresses the problem of aligning robot behavior with human preferences for deploying AI agents implemented in human-centered environments. Interactive imitation learning, in which a human expert observes policy execution and provides feedback on interventions, is presented as a promising solution. Existing methods have limitations in efficiently utilizing prior policies to facilitate learning. In this paper, we propose Maximum-Entropy Residual-Q Inverse Reinforcement Learning (MEReQ) for sample-efficient alignment from human intervention. Instead of inferring the full range of human behavioral characteristics, MEReQ infers a residual reward function that captures the differences between the baseline reward functions of the human expert and the prior policy. Using the residual reward function, the policy is then aligned to human preferences using Residual Q-Learning (RQL). Extensive evaluations on simulations and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention.
Takeaways, Limitations
•
Takeaways:
◦
We present MEReQ, a novel method for sample-efficient policy alignment from human intervention.
◦
Improve learning efficiency by effectively utilizing pre-emptive policies.
◦
Effectiveness verified in simulation and actual operations.
•
Limitations:
◦
Further research is needed to determine the generality of the proposed method and its applicability to various environments.
◦
The need to assess the frequency and quality of human expert intervention.
◦
Robustness assessment for complex tasks or various types of human feedback is needed.