Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling

Created by
  • Haebom

Author

Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang

Outline

This paper presents a novel framework that integrates scheduling and digital twins to evaluate schedulers for optimizing resource utilization in high-performance computing (HPC). This framework overcomes the limitations of existing post-deployment analysis or simulator approaches that do not model infrastructure. This framework enables what-if studies to understand the impact of parameter configurations and scheduling decisions on physical assets before deployment. Key findings include extending scheduling capabilities to the digital twin framework, integrating various HPC systems using public datasets, implementing extensions to integrate external scheduling simulators, and evaluating incentive structures and machine learning-based scheduling. Ultimately, this framework enables what-if scenarios to assess the sustainability of HPC systems and their impact on simulated systems.

Takeaways, Limitations

Takeaways:
A Novel Digital Twin-Based Meta-Framework for HPC Scheduler Evaluation
Optimization of scheduling strategies and parameters through pre-deployment assumption analysis
Scalability through integration with various HPC systems and external simulators.
Incentive structure and machine learning-based scheduling evaluation possible
Sustainability assessment of HPC systems is possible.
Limitations:
Further research is needed to verify the performance and application of the proposed framework to real-world HPC environments.
The need to review the accuracy and reliability of digital twin models
Consideration should be given to the limitations and data bias of the public datasets used.
Generalizability needs to be examined for various scheduling algorithms and systems.
👍