[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Rethinking the Foundations for Continual Reinforcement Learning

Created by
  • Haebom

Author

Esraa Elelimy, David Szepesvari, Martha White, Michael Bowling

Outline

This paper analyzes the differences between the traditional view of reinforcement learning (RL) and continuous reinforcement learning (CRL), and proposes a new formalism suitable for CRL. While traditional RL stops learning once it finds an optimal policy, CRL aims at continuous learning and adaptation. We argue that four pillars of traditional RL, namely Markov Decision Processes (MDPs), a focus on time-independent artifacts, an expected-reward sum evaluation metric, and an episode-based benchmark environment that follow these pillars, are in conflict with the goals of CRL. We propose a new formalism that replaces the first and third pillars of traditional RL with a new deviation regret evaluation metric suitable for history process and continuous learning, and discuss possible approaches to improve the other two pillars.

Takeaways, Limitations

Takeaways:
By clearly revealing that the traditional basis of existing RL is unsuitable for CRL, we suggest a new direction for CRL research.
We make an important contribution to CRL research by proposing a novel formalism (history process and deviation regret) suitable for CRL.
It overcomes the limitations of existing RL and suggests a new research direction for the development of CRL.
Limitations:
Further studies are needed to investigate the practical applicability and efficiency of the proposed new formalism.
There is a lack of specific methodological suggestions on how to improve the remaining two pillars (focus on time-independent artifacts and episodic-based benchmark environments).
There is a lack of discussion on the computational complexity of the proposed deviation regret and the difficulty of its practical application.
👍