Building and utilizing fault tolerance support tools for the GASPI applications
نویسندگان
چکیده
Today’s High Performance Computing (HPC) systems are made possible by multiple increase in hardware parallelity. This results in the decrease of mean time to failures (MTTF) of the systems with every newer generation, which is an alarming trend. Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. We have used GASPI, which is a relatively new communication library based on the PGAS model. It fulfills the basic requirement of a fault tolerant communication library , i.e., failure of a process do not cause the remaining processes to fail. This work is focused to extend the fault tolerance features of GASPI in form of a supporting health-check (HC) library that applications can benefit from. These features include failure detection, its information propagation, recovery management, communication recovery, etc. To reinforce its utility, we have also developed a fault tolerant neighbor node-level checkpoint/restart (CR) library. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how (using these supplementary fault tolerance functions) one can build applications to allow integrate a low cost fault detection/recovery mechanism and, if necessary, recover the application on the fly. We showcase the usage of these tools by implementing them in three different applications. Two of the applications fall in the category of linear sparse solvers, whereas the third application is based on a fluid flow solver. We also analyze the overheads involved in failure-free cases as well as various failure cases. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.
منابع مشابه
Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs
Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...
متن کاملA Runtime Environment for Supporting Research in Resilient HPC System Software & Tools
The high-performance computing (HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The runtime environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in parallel on the machine. The deployment of app...
متن کاملارائه یک رویکرد همانند سازی شده عامل محور در اجرای یک الگوی کد متحرک مطمئن
Abstract Using mobile agents, it is possible to bring the code close to the resources, which is not foreseen by the traditional client/server paradigm. Compared to the client/server computing paradigm, the greater flexibility of the mobile agent paradigm comes at additional costs as well as the additional complexity of developing and managing mobile agent-based applications. Such complexity ...
متن کاملA Tool for Constructing Service Replication Systems
Service replication is a key to providing high availability, fault tolerance and good performance in distributed systems. However, building a service replication system is a di cult and complex task. This paper describes a tool that mimics the design of the remote procedure call (RPC) system to support building distributed service replication systems. The tool includes an interface de nition la...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کامل