Saurabh Jha

PhD Candidate - Computer Science Department
Graduate Research Assistant - DEPEND@CSL
University of Illinois at Urbana-Champaign

I am a graduate research assistant in DEPEND Group at Coordinate Science Laboratory (CSL) and pursuing PhD in computer science at University of Illinois, Urbana - Champaign. My research interests include design of fault tolerance/recovery methods for large-scale systems (HPC and Cloud) and data analytic frameworks to support resiliency studies. Currently, I am collaborating with NCSA, SNL, LANL, NERSC and Cray on "Holistic, Measurement-Driven Resilience" project. More details of my research activities in Depend group can be found at subgroup page - "Resiliency For eXtreme Scale Systems".



University of Illinois at Urbana - Champaign

MS in Computer Science

Research Assistant in DEPEND Group @ CSL. My research advisors are Prof Dr. Ravishankar Iyer and Prof Dr. Zbigniew T. Kalbarczyk.


NTU Singapore

International Exchange Student

Vellore Institute of Technology, India(2010 - 2014)

B. Tech. in Computer Sciences & Engineering (Emphasis in Parallel Programming and heterogeneous systems), May 2014

  • Best Outgoing Student Award

Work Experience

Research Intern:

Distributed and Operating System Group, Microsoft Research (June 2015 - August 2015)

Developed methods to automate the process of finding issues leading to virtual machine failures.

Research Assistant:

DEPEND Group, Coordinated Science Laboratory, UIUC (Current)

Researching in the topics of fault tolerance and reliability of large scale systems.

Research Assistant:

Xtra Group, PDCC, NTU Singapore (Jan - July 2014)

Completed research in the topics of In-memory hash joins for accelerators.

Teaching Assistant:

Multicore System Programming Course, VIT University (Sep - Nov 2013)

Teaching Assistant for Multicore System Programming Course. Took extra tutorial sessions for a batch of 30 Junior year students.

Teaching Assistant:

Artificial Intelligence Course, VIT University (Feb - May 2013)

Worked as a Teaching Assistant for Artificial Intelligence Course. Took extra tutorial sessions for a batch of 60 undergraduate sophomore students.


May 2016

Understanding Gemini Interconnect Failovers on Blue Waters. Jha, S., Formicola, V., Di Martino, C., Kalbarczyk, Z., Kramer, W. and Iyer, R. In Proc. Cray User’s Group (CUG), London, England. April 2016 (to appear) [Link]

Feb 2016

Resiliency for eXtreme Scale Systems. Jha, CSL Student Conference, University of Illinois, (Best Poster Award) [Link]

JUL 2015

Saurabh Jha, LASE: Log Analysis and Storage Engine for Resiliency Study, NSF funded Data Science Workshop 2015, University of Washington [Link]

MAR 2015

Catello Di Martino, Saurabh Jha, Zbigniew Kalbarzczyk, William Krammer, Ravishankar K. Iyer , LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications, Fault Tolerance for HPC at eXtreme Scale (FTXS) 2015, HPDC 2015 [Link]

OCT 2014

Saurabh Jha, Mian Lu, Cheng Xuntao, Bingsheng He, Huynh Phung Huynh , Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach [Link]

Jan 2014

Saurabh Jha, Vijay Menon, BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem, High Performance Computing Symposium (HPC ’14), Spring Simulation Multi-conference. 2014, Tampa FL [Link]

AUG 2013

Saurabh Jha, Priyank Trivedi, An Automated Video Surveillance System Using Viewpoint Feature Histogram and CUDA-enabled GPUs, in proceedings of the Second International Symposium on Pattern Recognition and Image Processing, co-located with the IEEE ICACCI, Mysore, India [PDF][Link]

JUN 2013

Tejaswi Agarwal, Saurabh Jha, B Rajesh Kanna, P-HGRMS:A parallel Hypergraph based Root Mean Square Algorithm for Image Denoising, poster presented at 22th ACM Interenational Symposium on High Performance Parallel and Distributed Computing, New York, USA , HPDC 2013. (Best Poster Award)
                                                          [PDF][Extended Abstract] [Poster]

JUN 2013

Saurabh Jha, Tejaswi Agarwal, B Rajesh Kanna Exploiting Data Parallelism in the yConvex Hypergraph Algorithm for image representation using GPGPUs, poster presentation at 27th IEEE/ACM International Conference on Supercomputing, Eugene, Oregon, USA , ICS 2013 [PDF][Extended Abstract] [Poster]


OCT 2015 - March 2016

Understanding Gemini Interconnect Failures in Cray HPC systems

In this work, we designed tools to model and study failure propagation in Gemini interconnect systems. The failure propagation is modeled using topologically aware state machine based coalescing algorithm. This analysis allows us to assess the recovery capabilities of Gemini and also focus on those cases where the failover mechanisms indeed fail to recover, resulting in partial or total interruption of system services (system-wide outages). It is a first large scale study that models and quantifies interconnect failures in HPC systems.

DEC 2014 - FEB 2015

LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications

In this work, we built LogDiver tool capable of analyze system logs, job logs and manual failure reports. The tool is able: i) to filter, extract, and classify error data from different sources of information, such as system logs, hardware sensors and workload logs; ii) to extract signals from the categorized errors; iii) to consolidate user application data and decode application and job exit status, highlighting the reasons for the application/job exit; and iv) to correlate application failures with errors using a mix of empirical and analytical techniques.

Jan - June 2014

Hash Joins on Xeon Phi

Guide: Dr. Bingsheng He , NTU Singapore and Dr. Mian Lu, IHPC, A* STAR, Singapore
In this work, we implement hash join algorithms on Xeon Phi and experimentally compare the performance of cache-conscious and cahce oblivious implementation of hash join on Xeon Phi. We also compare the different trends of hash joins on wide range of parameters on Intel Phi and Intel Xeon. More information and code snippets can be found at project web page .

AUG - OCT 2013

Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem,

In this work, we proposed a GPU based parallel simulated annealing algorithm for mirrored Traveling Tournament Problem (mTTP)and test the available instances on nvidia CUDA devices. We also introduced a new mTTP instance IPL-09 modelled on Indian Premier League - one of India's most popular sports league. Applying the proposed algorithm on this instance, we were able to reduce the total distance traveled by 30.20% or roughly 37,000 kilo-meters. The proposed algorithm converged faster to best known solutions with a significant speed up of 50-80x in terms of the number of solutions explored per second.

FEB - MAY 2013

Exploiting data parallelism using yConvex Hypergraph (yCHG) algorithm for image processing using GPGPU

Guide: Dr. Rajesh Kanna
The aim of the project was to parallelize and evaluate the performance of image processing hypergraph algorithms such as image denoising and image representation on CUDA devices. Overall increase in performance was 10-20x times over CPU's.
This work was presented at ICS 2013 in Eugene, Oregon and HPDC 2013, NY, USA.


  • Won best poster award at CSL Student Conference 2016
  • Several travel awards - UIUC College of Engineering travel grants, VLDB 2015
  • Received Blue Waters exploratory grant [Co-PI]
  • Best Outgoing Student Award, 2010 - 2014 batch, VIT Chennai
  • Won best poster award at High-Performance Parallel and Distributed Computing 2013 Details
  • Appreciated from University for Open Source Software Development and promotion in University


  • 18/02/2017: "Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo", technical paper in CUG 2017
  • "Resiliency of HPC Interconnects: A case study of interconnect failures and recovery in Blue Waters", submitted, under revision
  • 19/02/2016: "Understanding Interconnect Failovers", technical paper accepted in Cray User Group 2016
  • 18/02/2016: Won Best Poster Award in CSL student conference for poster titled "Resiliency For eXtreme Scale Systems"
  • 26/11/2016: Blue Waters Exploratory grant proposal accepted
  • 09/07/2015: Invited to attend NSF Data Science Workshop at University of Washington based on submitted abstract paper for NSF Grand Challenge
  • 16/03/2015: Logdiver paper accepted in FTXS, HPDC 2015
  • 14/12/2014: Hash join paper for Xeon Phi accepted, VLDB 2015
  • 12/01/2014: Research paper titled "BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem" at HPC 2014 part of SpringSim Multiconference, Tampa FL
  • 12/2013: Selected as Research Scholar and Intl. Student Exchange Student at NTU Singapore
  • 25/07/2013: Research Paper accepted at SRS 2013 colocated with ICACCI 2013
  • 21/06/2013: Won best poster award at HPDC 2013 held at New York, USA
  • 30/05/2013: Research Paper accepted at ICACCI 2013
  • 20/05/2013: Research Poster accepted at HPDC 2013
  • 20/04/2013: Research Poster accepted at ICS 2013

Contact Info & Social Networks

Portrait Photograph

Interested? Drop me a message .