Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Automatic Reward Shaping from Confounded Offline Data

Created by
  • Haebom

Author

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Outline

This paper addresses a core challenge in artificial intelligence: effective policy learning to control agents in unknown environments and optimize performance metrics. Off-policy learning methods, such as Q-learning, allow learners to make optimal decisions based on past experience. This paper studies off-policy learning from biased data in complex, high-dimensional domains where unobserved confounding variables cannot be excluded in advance. Based on the well-known Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm that is robust to observed data confounding biases. Specifically, the algorithm attempts to find a safe policy for the worst-case environment that is compatible with observations. We apply the proposed method to twelve perturbed Atari games and demonstrate that the proposed method consistently outperforms the standard DQN in all games where observed inputs to the action and goal policies are inconsistent and unobserved confounding variables are present.

Takeaways, Limitations

Takeaways: We present a novel algorithm that improves the performance of off-policy reinforcement learning in complex environments with unobserved confounding variables. The proposed algorithm outperforms conventional DQN on Atari games. The approach of finding a safe policy for worst-case environments is shown to be effective in enhancing robustness against confounding bias.
Limitations: The performance evaluation of the proposed algorithm is limited to Atari games, and its generalizability to other types of environments or problems requires further research. Approaches that assume worst-case scenarios without explicitly modeling unobserved confounding variables may result in conservative policies. Further verification of applicability and efficiency in real-world settings is needed.
👍