This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling
Created by
Haebom
Author
Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang
Outline
This paper presents the first framework that integrates scheduling and digital twins to evaluate schedulers for optimizing resource utilization in high-performance computing (HPC). This framework overcomes the limitations of existing post-deployment analysis or simulators that do not model infrastructure. This framework enables what-if scenario analysis to understand the impact of parameter configurations and scheduling decisions on physical assets before deployment, and also allows for the re-examination of changes that are not easily implemented in real-world operating environments. Specifically, this framework provides a digital twin framework that extends scheduling capabilities, integrates various top-tier HPC systems based on public datasets, implements expanded integration with external scheduling simulators, implements and evaluates incentive structures, and performs machine learning-based scheduling evaluations. This enables what-if scenarios to assess the sustainability of HPC systems and their impact on simulation systems.
Takeaways, Limitations
•
Takeaways:
◦
A New Paradigm for HPC Scheduler Evaluation: Pre-evaluation Based on Digital Twins
◦
Providing an integrated evaluation environment for various HPC systems and scheduling techniques.
◦
Effective evaluation and prototyping of incentive structures and machine learning-based scheduling.
◦
Enables what-if analysis of sustainability and system impacts
•
Limitations:
◦
The need to verify the accuracy and realism of digital twins
◦
System applicability limitations due to limited available public datasets.
◦
Difficulties in accurately modeling complex HPC systems
◦
Further research is needed on the scalability and maintainability of the proposed framework.