Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

An Analysis of Action-Value Temporal-Difference Methods That Learn State Values

Created by
  • Haebom

Author

Brett Daley, Prabhat Nagarajan, Martha White, Marlos C. Machado

Outline

This paper focuses on the characteristics of bootstrapping (generating new value predictions using previous value predictions) in temporal-difference learning (TD), and most TD control methods use bootstrapping from a single action-value function (e.g., Q-learning, Sarsa). In contrast, methods that use two asymmetric value functions (e.g., QV-learning or AV-learning) to learn action values using state values as intermediate steps have received relatively little attention. This paper analyzes these algorithm families in terms of convergence and sampling efficiency, revealing that while both families are more efficient than Expected Sarsa in the prediction setting, only AV-learning offers a significant advantage over Q-learning in the control setting. Finally, we present Regularized Dueling Q-learning (RDQ), a novel AV-learning algorithm that significantly outperforms Dueling DQN on the MinAtar benchmark.

Takeaways, Limitations

Takeaways:
We show that AV-learning methods that use two asymmetric value functions instead of a single action-value function can be more efficient than Q-learning in control settings.
We experimentally demonstrate that a new AV-learning algorithm, RDQ, outperforms the existing Dueling DQN.
In the predictive setting, both QV-learning and AV-learning are shown to be more efficient than Expected Sarsa.
Limitations:
Analysis of the advantages and disadvantages of QV-learning and AV-learning may be limited. They may be effective only in certain environments or problems.
RDQ's performance improvements may be limited to the MinAtar benchmark and may not generalize to other environments.
The analysis presented in this paper is limited to specific algorithms and benchmarks, and therefore requires more extensive experimental validation.
👍