Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Created by
  • Haebom

Author

Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forre , David Krueger, Joar Skalse

Outline

In reinforcement learning, specifying a reward function that captures the intended action can be very difficult. Reward learning attempts to solve this problem by learning a reward function. However, the learned reward model may produce policies with low errors in the data distribution, but then have large regrets. We say that such reward models suffer from error-regret inconsistency. The main cause of error-regret inconsistency is the distribution shift that typically occurs during policy optimization. In this paper, we mathematically show that while the reward model guarantees a sufficiently low expected test error to have low worst-case regret, there are realistic data distributions where error-regret inconsistency can occur for any fixed expected test error. We then show that similar problems persist even when using policy regularization techniques commonly used in methods such as RLHF. We hope that our results will stimulate theoretical and empirical research on improved ways to learn reward models and better ways to reliably measure their quality.

Takeaways, Limitations

Takeaways: We mathematically proved that a low expected test error of a reward model does not always guarantee low regret, and that there is an error-regret mismatch problem. We showed that even policy regulation techniques cannot completely solve this problem. This suggests the need for research on improving the learning and evaluation methods of reward models.
Limitations: This paper focuses on theoretical analysis and does not provide experimental verification on real datasets or algorithms. Also, it does not provide specific methodology to solve the error-regret mismatch problem.
👍