Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps

Created by
  • Haebom

Author

Ningyuan Yang, Jiaxuan Gao, Feng Gao, Yi Wu, Chao Yu

Outline

In this paper, we propose a Noise-Conditioned Deterministic Policy Optimization (NCDPO) framework to solve the __T7488__ of diffusion policies. Diffusion policies can learn various techniques with strong expressive power, but they can generate suboptimal trajectories or cause serious errors due to the lack and inadequacy of demo data. Existing reinforcement learning-based fine-tuning methods have difficulty in effectively applying PPO to diffusion models due to the computational difficulty of estimating action probabilities during the denoising process. NCDPO treats each denoising step as a differentiable transformation conditioned on pre-sampled noise, enabling estimation and backpropagation through all diffusion steps. Experimental results show that NCDPO outperforms existing methods in both sample efficiency and final performance on various benchmarks (including continuous robot control and multi-agent game scenarios). In particular, it achieves sample efficiency similar to MLP+PPO when learning from randomly initialized policies, and is robust to the number of diffusion steps.

Takeaways, Limitations

Takeaways:
NCDPO solves the sample efficiency problem of diffusion policies, enabling more effective policy learning through combination with reinforcement learning.
By demonstrating superior performance over existing methods in various benchmarks, it has increased applicability in various fields such as actual robot control and game AI.
By demonstrating robustness to the number of diffusion steps, we reduce the burden of hyperparameter tuning.
Limitations:
The experimental results presented in this paper are limited to a specific benchmark, and generalization to other environments or tasks requires further study.
There is a lack of quantitative analysis on how much the computational complexity of NCDPO increases compared to existing methods.
There is a lack of analysis on the dependency on the quality of demo data and the potential performance degradation when demo data is lacking.
👍