[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Created by
  • Haebom

Author

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

Outline

In this paper, we introduce the CUDA-L1 framework, which provides an automated CUDA optimization strategy to cope with the rapidly increasing demand for GPU computing resources due to the development of large-scale language models. CUDA-L1, based on reinforcement learning, is trained on NVIDIA A100 and achieves an average speedup of x17.7 for 250 CUDA kernels in KernelBench, and a maximum speedup of x449. In addition, although it is trained specifically for A100, it shows excellent portability on various GPU architectures such as H100, RTX 3090, L40, H800, and H20. CUDA-L1 discovers various CUDA optimization techniques and strategically combines them to achieve optimal performance, discovers the fundamental principles of CUDA optimization, and rejects optimizations that cause performance degradation. We demonstrate the potential of reinforcement learning to transform an LLM with poor initial performance into an effective CUDA-optimized model with only a speedup-based reward signal, without human expertise or domain knowledge.

Takeaways, Limitations

Takeaways:
We present a new possibility to transform LLM into an effective CUDA-optimized model via reinforcement learning.
Enhanced versatility by ensuring excellent portability across various GPU architectures.
Enables automated CUDA optimization without human expertise.
It can contribute to improving GPU efficiency and solving GPU computing resource shortage problems.
You can discover the fundamental principles of CUDA optimization and contribute to the discovery of new optimization techniques.
Limitations:
This paper only presents results for a specific benchmark (KernelBench), so generalization performance for other types of CUDA kernels requires additional verification.
There may be a dependency on the GPU architecture (A100) used for training. Generalization performance on other architectures needs to be further improved.
Further research is needed on performance and stability when applied to real-world applications.
👍