Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Exploration Behavior of Untrained Policies

Created by
  • Haebom

Author

Jacob Adamczyk

Outline

This paper addresses the challenge of exploration in reinforcement learning (RL), especially in environments with sparse or adversarial reward structures. We study how the structure of a deep neural network policy implicitly influences exploration before training. Using a simple model, we demonstrate theoretically and experimentally a strategy for generating ballistic or diffuse trajectories from untrained policies. Using infinite-width network theory and continuous-time limits, we show that untrained policies return correlated actions and generate important state visit distributions. We discuss the distribution of corresponding trajectories for standard architectures, and provide insight into the inductive bias for solving exploration problems in early training. As a result, we establish a theoretical and experimental framework that uses policy initialization as a design tool for understanding exploration behavior.

Takeaways, Limitations

Takeaways:
We theoretically and experimentally elucidate the mechanisms by which the architecture of an untrained policy influences its initial exploration behavior.
We present a novel framework for leveraging policy initialization to design exploration strategies.
Provides deep insights through analysis using infinite-width network theory and continuous-time limits.
Limitations:
Experiments were conducted using a simple model (toy model), and further research is needed on the generalizability to real complex RL environments.
Limited to analysis of a specific architecture, further research on various architectures is needed.
Focusing on the analysis of the early stages of training, the impact on long-term exploration performance requires further study.
👍