Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

Created by
  • Haebom

Author

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Outline

This paper presents a novel structural reward learning framework, Policy-Aware Matrix Completion (PAMC), to address the challenges of sparse-reward reinforcement learning (RL). PAMC exploits the approximate low-dimensional and sparse structure of the reward matrix under policy-biased sampling. It uses back-prone weights to prove a recovery guarantee and establishes a visit-weighted error-regret bound that links completion error to control performance. When the assumption weakens, PAMC widens the confidence interval to safely return to exploration and halts the algorithm. Experimentally, PAMC improves sample efficiency on the Atari-26, DM Control, MetaWorld MT50, D4RL offline RL, and baseline RL benchmarks, and outperforms DrQ-v2, DreamerV3, Agent57, T-REX/D-REX, and PrefPPO in computational regularization comparisons. These results highlight PAMC as a practical and principled tool in the presence of structural rewards and serve as the first concrete example of a broader structural reward learning perspective.

Takeaways, Limitations

Takeaways:
We show that the sample efficiency of sparse-reward reinforcement learning can be improved by exploiting the low-dimensional + sparse structure of the reward matrix even under policy-biased sampling.
We present theoretical justification through inverse-propensity weighting and visit-weighted error-regret boundaries.
We present experimental results that outperform existing methods in various benchmarks.
It presents a new perspective called structural reward learning and provides a specific methodology for it.
Limitations:
It requires the assumption that the reward matrix has an approximately low-dimensional + sparse structure, and this assumption is not always satisfied.
If the assumption becomes weak, the algorithm stops and safely returns to exploration, but this may lead to performance degradation.
Experimental results are limited to a specific benchmark, and performance may vary in other environments.
👍