Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications
نویسندگان
چکیده
Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enables failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.
منابع مشابه
Fault Tolerance Assistant (FTA): An Exception Handling
We propose FTA, a programming model that provides failure localization and transparent recovery of process failures in MPI applications.
متن کاملA Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum’s Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This pap...
متن کاملImplementing Coordinated Exception Handling for Distributed Object-Oriented Systems with AspectJ
Exception handling is a very popular technique for incorporating fault tolerance into software systems. However, its use for structuring concurrent, distributed systems is hindered by the fact that the exception handling models of many mainstream object-oriented programming languages are sequential. In this paper we present an aspect-based framework for incorporating concurrent exception handli...
متن کاملProgramming Secure and Robust Pervasive Computing Applications
We have developed a programming framework for building context-aware multi-user collaborative applications in pervasive computing environments. It supports context-sensitive security and multi-user coordination requirements. It also supports error handling in pervasive computing applications through an exception handling model. In this paper we present the programming framework and demonstrate ...
متن کاملImplementing Coordinated Error Recovery for Distributed Object-Oriented Systems with AspectJ
Exception handling is a very popular technique for incorporating fault tolerance into software systems. However, its use for structuring concurrent, distributed systems is hindered by the fact that the exception handling models of many mainstream object-oriented programming languages are sequential. In this paper we present an aspect-based framework for incorporating concurrent exception handli...
متن کامل