Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EXPO: Stable Reinforcement Learning with Expressive Policies

Created by
  • Haebom

Author

Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn

Outline

This paper studies the problem of learning and fine-tuning expressive policies with online reinforcement learning (RL) using offline datasets. Expressive policies (such as diffusion and flow-matching policies) are parameterized by long denoising chains, which hinders stable gradient propagation from actions to policy parameters when optimizing for the value function. Therefore, in this paper, we avoid directly optimizing value with expressive policies, and instead solve the stable value maximization problem by constructing an on-the-fly RL policy that maximizes Q-value. To this end, we propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that uses two parameterized policies: a large, expressive base policy learned with a stable imitation learning objective, and a light Gaussian modification policy that modifies actions sampled from the base policy toward a higher value distribution. EXPO optimizes the actions of the base policy with the learned modification policy, and selects the action that maximizes value among the base and modified actions for both sampling and temporal lag (TD) backup. The proposed method achieves an average of 2-3 times better sample efficiency than existing methods in both settings where pre-trained policies are fine-tuned with offline data and settings where offline data is used for online learning.

Takeaways, Limitations

Takeaways:
We present EXPO, a novel methodology for robust online reinforcement learning of expressive policies.
Effective use of offline data significantly improves sample efficiency (2-3 times compared to existing methods).
Applicable to both fine-tuning of pre-trained policies and online learning based on offline data.
Limitations:
The possibility that EXPO's performance may be biased towards specific problems or datasets.
Further research is needed on the design of Gaussian correction policies.
Performance evaluation in high-dimensional state spaces or complex environments is required.
👍