LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications

Abstract

This paper presents LogDiver, a tool for the analysis of application-level resiliency in extreme-scale computing systems. The tool has been implemented to handle data generated by system monitoring tools in Blue Waters, the petascale machine in production at the University of Illinois’ National Center for Supercomputing Applications. The tool is able: i) to filter, extract, and classify error data from different sources of information, such as system logs, hardware sensors and workload logs; ii) to extract signals from the categorized errors; iii) to consolidate user application data and decode application and job exit status, highlighting the reasons for the application/job exit; and iv) to correlate application failures with errors using a mix of empirical and analytical techniques. To the best of our knowledge, this is the first tool capable of measuring application-level resiliency in extreme-scale machines. We also demonstrate the power of the tool by showing that XK applications are more vulnerable to failures when compared to XE applications.

Publication
FTXS (colocated with HPDC) 2015
Date
Links