Holistic Measurement-Driven System Assessment

Abstract

In high-performance computing systems, applica- tion performance and throughput are dependent on a complex interplay of hardware and software subsystems and variable workloads with competing resource demands. Data-driven in- sights into the potentially widespread scope and propagation of impact of events, such as faults and contention for shared resources, can be used to drive more effective use of resources, for improved root cause diagnosis, and for predicting performance impacts. We present work developing integrated capabilities for holistic monitoring and analysis to understand and characterize propagation of performance-degrading events. These characteri- zations can be used to determine and invoke mitigating responses by system administrators, applications, and system software.

Publication
HPCMASPA 2017, CLUSTER 2017
Date