Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid
نویسندگان
چکیده
Application robustness becomes a major concern with the continued scaling of high performance computing (HPC). In a recent study [8], we have developed an adaptive fault management scheme called FT-Pro for improving application robustness by combining the merits of proactive process migration and reactive checkpointing. In this paper, we push forward this study by integrating FT-Pro with a production-level MPI package and investigating its effectiveness across a number of real-world parallel applications. Extensive experiments are conducted on an IA32 cluster at TeraGrid/ANL by comparing FT-Pro as against periodic checkpointing under a wide range of system parameters and failure behaviors. These preliminary experiments show the potential of using adaptive fault tolerance to improve application performance in the presence of failures.
منابع مشابه
A Robust Adaptive Observer-Based Time Varying Fault Estimation
This paper presents a new observer design methodology for a time varying actuator fault estimation. A new linear matrix inequality (LMI) design algorithm is developed to tackle the limitations (e.g. equality constraint and robustness problems) of the well known so called fast adaptive fault estimation observer (FAFE). The FAFE is capable of estimating a wide range of time-varying actuator fault...
متن کاملThreat Detection in an Urban Water Distribution Systems with Simulations Conducted in Grids and Clouds
We present a workflow-based algorithm for identifying threads to an urban water management system. Through Grid computing we provide the necessary high-performance computing resources to deliver quickly solutions to the problem. We prototyped a new middleware called cyberaide, that enables easy access to Grid resources through portals or the command line. A workflow system is used to manage res...
متن کاملCAFT: Cost-aware and Fault-tolerant routing algorithm in 2D mesh Network-on-Chip
By increasing, the complexity of chips and the need to integrating more components into a chip has made network –on- chip known as an important infrastructure for network communications on the system, and is a good alternative to traditional ways and using the bus. By increasing the density of chips, the possibility of failure in the chip network increases and providing correction and fault tol...
متن کاملPeaking Attenuation in High-Gain Observers Using Adaptive Saturation: Application to a Ball and Wheel System
Despite providing robustness, high-gain observers impose a peaking phenomenon, which may cause instability, on the system states. In this paper, an adaptive saturation is proposed to attenuate the undesirable mentioned phenomenon in high-gain observers. A real-valued and differentiable sigmoid function is considered as the saturating element whose parameters (height and slope) are adaptively tu...
متن کاملNovel Defect Terminolgy Beside Evaluation And Design Fault Tolerant Logic Gates In Quantum-Dot Cellular Automata
Quantum dot Cellular Automata (QCA) is one of the important nano-level technologies for implementation of both combinational and sequential systems. QCA have the potential to achieve low power dissipation and operate high speed at THZ frequencies. However large probability of occurrence fabrication defects in QCA, is a fundamental challenge to use this emerging technology. Because of these vari...
متن کامل