link to homepage

Institute for Advanced Simulation (IAS)

Navigation and service

Fault-Tolerant Parallel-in-Time Integration (FaT-PinT)

Resilience is one of the major topics in modern high-performance computing (HPC) research. With millions of processors, the “mean-times between failure” become a relevant issue for simulation scientist around the world. Incorporating countermeasures on the hardware side is expensive, difficult, time-consuming or all at once. Thus, much attention has therefore been paid to “algorithm-based fault tolerance” strategies which exploit specific features of numerical methods to continue working even after a processor crashes or a bit flips. While previous research targeted rather low-level operations like matrix-vector-multiplications, this project aims at using novel methods from the field of parallel-in-time integration techniques for detecting and correcting these faults.

Parallel-in-time integration methods for time-dependent partial differential equations allow to integrate multiple time-steps simultaneously and have been mainly considered as means to extend scaling limits of spatial parallelization and/or to improve utilization of very large machines . However, many of these algorithms share features that make them natural candidates for algorithmic-based fault tolerance, allowing them to continue integrating forward in time even when nodes fail or bits flip. With this novel combination, parallel-in-time integrators could then help applications to become resilient against faults while also potentially increasing their degree of parallelism.

This project will be supported by experts from the University of Leeds and embedded into the Joint Laboratory on Extreme Scale Computing (JLESC). JSC as one of six JLESC members supports its scientists and PhD students to attend the frequent workshops and to realize short research visits. While the group in Leeds supports this project with their extensive expertise in the field of parallel-in-time integration, the connection to the Joint Lab serves as door opener to the resilience community.


Forschungszentrum Jülich GmbH, Germany
The contact person is Robert Speck

Joint Laboratory on Extreme Scale Computing (JLESC)

Daniel Ruprecht, School of Mechanical Engineering, University of Leeds, UK

This project is funded by Forschungszentrum Jülich GmbH.

The grant period is April 2018 until March 2021.