In this paper, we present a novel approach to improve performance and ensure portability of large-scale language model (LLM) inference. To address the problems of poor portability due to traditional single-platform dependency, vendor lock-in, and new AI hardware entry barriers, we propose a method that combines just-in-time (JIT) compilation with comprehensive kernel parameter auto-tuning. Focusing on performance-critical LLM kernels, we show that our method explores up to 15x more kernel parameter configurations, generates significantly more diverse code across multiple dimensions, and improves performance by up to 230% over vendor-optimized implementations, while reducing kernel code size by 70x and eliminating manual code optimization. Our results highlight that auto-tuning is a promising approach to improve model portability across GPU vendors.