Building and utilizing fault tolerance support tools for the GASPI applications

نویسندگان

Faisal Shahzad

Moritz Kreutzer

Thomas Zeiser

Rui Machado

Andreas Pieper

Georg Hager

Gerhard Wellein

چکیده

Today’s High Performance Computing (HPC) systems are made possible by multiple increase in hardware parallelity. This results in the decrease of mean time to failures (MTTF) of the systems with every newer generation, which is an alarming trend. Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. We have used GASPI, which is a relatively new communication library based on the PGAS model. It fulfills the basic requirement of a fault tolerant communication library , i.e., failure of a process do not cause the remaining processes to fail. This work is focused to extend the fault tolerance features of GASPI in form of a supporting health-check (HC) library that applications can benefit from. These features include failure detection, its information propagation, recovery management, communication recovery, etc. To reinforce its utility, we have also developed a fault tolerant neighbor node-level checkpoint/restart (CR) library. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how (using these supplementary fault tolerance functions) one can build applications to allow integrate a low cost fault detection/recovery mechanism and, if necessary, recover the application on the fly. We showcase the usage of these tools by implementing them in three different applications. Two of the applications fall in the category of linear sparse solvers, whereas the third application is based on a fluid flow solver. We also analyze the overheads involved in failure-free cases as well as various failure cases. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs

Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...

متن کامل

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools

The high-performance computing (HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The runtime environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in parallel on the machine. The deployment of app...

متن کامل

ارائه یک رویکرد همانند سازی شده عامل محور در اجرای یک الگوی کد متحرک مطمئن

Abstract Using mobile agents, it is possible to bring the code close to the resources, which is not foreseen by the traditional client/server paradigm. Compared to the client/server computing paradigm, the greater flexibility of the mobile agent paradigm comes at additional costs as well as the additional complexity of developing and managing mobile agent-based applications. Such complexity ...

متن کامل

A Tool for Constructing Service Replication Systems

Service replication is a key to providing high availability, fault tolerance and good performance in distributed systems. However, building a service replication system is a di cult and complex task. This paper describes a tool that mimics the design of the remote procedure call (RPC) system to support building distributed service replication systems. The tool includes an interface de nition la...

متن کامل

Improving the palbimm scheduling algorithm for fault tolerance in cloud computing

Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Building and utilizing fault tolerance support tools for the GASPI applications

نویسندگان

چکیده

منابع مشابه

Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools

ارائه یک رویکرد همانند سازی شده عامل محور در اجرای یک الگوی کد متحرک مطمئن

A Tool for Constructing Service Replication Systems

Improving the palbimm scheduling algorithm for fault tolerance in cloud computing

عنوان ژورنال:

اشتراک گذاری