Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling

Created by
  • Haebom

Author

Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang

Outline

This paper presents the first framework that integrates scheduling and digital twins to evaluate schedulers for optimizing resource utilization in high-performance computing (HPC). This framework overcomes the limitations of existing post-deployment analysis or simulators that do not model infrastructure. This framework enables what-if scenario analysis to understand the impact of parameter configurations and scheduling decisions on physical assets before deployment, and also allows for the re-examination of changes that are not easily implemented in real-world operating environments. Specifically, this framework provides a digital twin framework that extends scheduling capabilities, integrates various top-tier HPC systems based on public datasets, implements expanded integration with external scheduling simulators, implements and evaluates incentive structures, and performs machine learning-based scheduling evaluations. This enables what-if scenarios to assess the sustainability of HPC systems and their impact on simulation systems.

Takeaways, Limitations

Takeaways:
A New Paradigm for HPC Scheduler Evaluation: Pre-evaluation Based on Digital Twins
Providing an integrated evaluation environment for various HPC systems and scheduling techniques.
Effective evaluation and prototyping of incentive structures and machine learning-based scheduling.
Enables what-if analysis of sustainability and system impacts
Limitations:
The need to verify the accuracy and realism of digital twins
System applicability limitations due to limited available public datasets.
Difficulties in accurately modeling complex HPC systems
Further research is needed on the scalability and maintainability of the proposed framework.
👍