Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Created by
  • Haebom

Author

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

Outline

This paper addresses the need for an automated CUDA optimization strategy to address the rapidly increasing demand for GPU computing resources due to the development of large-scale language models. Unlike existing state-of-the-art models that show low success rates in improving CUDA speedups, this paper proposes CUDA-L1, an automated CUDA optimization framework based on reinforcement learning. CUDA-L1 is trained on NVIDIA A100 and achieves an average speedup of x17.7 and a maximum speedup of x449 on 250 CUDA kernels in KernelBench. In addition, although it is trained specifically for A100, it shows excellent transferability on various GPU architectures such as H100, RTX 3090, L40, H800, and H20. CUDA-L1 discovers various CUDA optimization techniques and strategically combines them to achieve optimal performance, and it discovers the fundamental principles of CUDA optimization and rejects inefficient optimizations. This study demonstrates that reinforcement learning can be used to transform LLMs with initially poor performance into effective CUDA-optimized models, suggesting the potential for automated CUDA computational optimization.

Takeaways, Limitations

Takeaways:
Demonstrating the utility of automated CUDA optimization using reinforcement learning.
Excellent portability across various GPU architectures.
Discover the fundamental principles of CUDA optimization and suggest the possibility of discovering new optimization techniques.
Potential to contribute to increasing GPU efficiency and solving GPU computing resource shortage issues.
Limitations:
Currently limited to performance evaluation on KernelBench dataset. Generalization to other types of CUDA kernels or more complex applications is needed.
There may be a dependency on the GPU architecture (A100) used for training. Extensive testing and optimization for different architectures is required.
It is expected that the computing resources required for learning and executing CUDA-L1 will be significant. Research is needed to improve resource efficiency.
👍