Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Experience Deploying Containerized GenAI Services at an HPC Center

Created by
  • Haebom

Author

Angel M. Beltre, Jeff Ogden, Kevin Pedretti

Outline

This paper focuses on how components used to build Generative Artificial Intelligence (GenAI) applications—including inference servers, object stores, vector and graph databases, and user interfaces—are interconnected via web-based APIs. In particular, it highlights the growing trend of containerized deployment of these components in cloud environments, highlighting the need for related technology development in high-performance computing (HPC) centers. This paper discusses the integration of HPC and cloud computing environments and presents a converged computing architecture that integrates HPC and Kubernetes platforms to run containerized GenAI workloads. A case study of the deployment of the Llama Large Language Model (LLM) demonstrates the deployment of a containerized inference server (vLLM) using multiple container runtimes on Kubernetes and HPC platforms. This paper presents practical considerations and opportunities for the HPC container community and provides guidance for future research and tool development.

Takeaways, Limitations

Takeaways:
By sharing practical experience deploying GenAI workloads at our HPC Center, we present a practical guide to integrating HPC and cloud computing environments.
We propose a converged computing architecture for executing containerized GenAI workloads, contributing to the establishment of a reproducible research environment.
A case study of the Llama LLM deployment demonstrates the applicability of container technology on Kubernetes and HPC platforms.
We present practical considerations and opportunities for the HPC container community, suggesting future research directions.
Limitations:
Limited to a case study of a specific LLM model (Llama) and a specific container runtime environment, generalized conclusions may be limited.
Additional information about the specific architecture and implementation details is needed, and may present additional challenges in practical application.
Additional performance evaluation and optimization methods that take into account the specific characteristics of HPC environments are needed.
👍