[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

GPU Performance Portability needs Autotuning

Created by
  • Haebom

Author

Burkhard Ringlein, Thomas Parnell, Radu Stoica

Outline

In this paper, we present a novel approach to improve performance and ensure portability of large-scale language model (LLM) inference. To address the problems of poor portability due to traditional single-platform dependency, vendor lock-in, and new AI hardware entry barriers, we propose a method that combines just-in-time (JIT) compilation with comprehensive kernel parameter auto-tuning. Focusing on performance-critical LLM kernels, we show that our method explores up to 15x more kernel parameter configurations, generates significantly more diverse code across multiple dimensions, and improves performance by up to 230% over vendor-optimized implementations, while reducing kernel code size by 70x and eliminating manual code optimization. Our results highlight that auto-tuning is a promising approach to improve model portability across GPU vendors.

Takeaways, Limitations

Takeaways:
Suggesting the possibility of improving the portability and performance of LLM inference through JIT compilation and automatic tuning.
Demonstrates the potential to achieve performance that surpasses vendor-optimized implementations.
Increased development efficiency by reducing kernel code size and eliminating manual optimizations.
A new direction for ensuring model portability across GPU vendors.
Limitations:
Further studies are needed to investigate the generalizability of the presented method and its applicability to different LLM architectures and sizes.
Analysis of the computational cost and time required for the automatic adjustment process is needed.
Further evaluation of performance and stability in real application environments is needed.
Focused on optimization for a specific LLM kernel, lack of consideration for performance improvements in other areas.
👍