Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CaRL: Learning Scalable Planning Policies with Simple Rewards

Created by
  • Haebom

Author

Bernhard Jaeger, Daniel Dauner, Jens Bei{\ss}wenger, Simon Gerstenecker, Kashyap Chitta, Andreas Geiger

Outline

This paper studies reinforcement learning (RL) for privileged planning in autonomous driving. Existing approaches are rule-based but lack scalability. In contrast, RL offers high scalability and avoids the cumulative error problem of imitation learning. Existing RL approaches for autonomous driving use complex reward functions that aggregate multiple individual rewards, such as progress, position, and orientation. This paper demonstrates that PPO fails to optimize these reward functions as the mini-batch size increases, limiting its scalability. Therefore, this paper proposes a novel reward design that optimizes a single intuitive reward, path completion. Violations are punished by either terminating the episode or multiplicatively decreasing path completion. We demonstrate that PPO trained with the proposed simple reward scales well with larger mini-batch sizes and achieves improved performance. Training with large mini-batch sizes enables efficient scaling through distributed data parallelism. We scaled the training to 300 million samples in CARLA and 500 million samples in nuPlan on a single 8-GPU node. The resulting model achieved 64 DS on the CARLA longest6 v2 benchmark, significantly outperforming other RL methods using more complex rewards. With minimal modifications to the CARLA method, it also achieved the best learning-based approach on nuPlan. On the Val14 benchmark, it achieved 91.3 points for non-responsive traffic and 90.6 points for responsive traffic, achieving a 10x improvement over previous research.

Takeaways, Limitations

Takeaways:
We significantly improve the scalability of PPO by using a simple path completion reward function.
We present a method for efficiently learning large datasets.
We achieved SOTA performance on both CARLA and nuPlan benchmarks.
We propose a simpler and more effective reward function than the existing complex reward functions.
Limitations:
Further research is needed to determine whether the proposed method is applicable to all autonomous driving environments.
Simplification of the reward function may result in performance degradation in certain situations.
Since the results of the experiment were obtained using 8-GPU nodes, performance in environments with fewer GPUs has not been confirmed.
👍