In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
نویسندگان
چکیده
Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this paper, we apply two kinds of XOR-based doubleerasure codes RDP (Row-Diagonal Parity) and B-Code to in-memory checkpointing for MPI programs. We develop scalable checkpointing/recovery algorithms which embed erasure code encoding/decoding computation into MPI collective communications operations. The experiments show that the scalable algorithms decrease communication overhead and balance computation effectively. Our approach provides highly reliable, fast in-memory checkpointing for
منابع مشابه
On the Speedup of Recovery in Large - Scale Erasure - Coded Storage Systems ( Supplementary File )
Our work focuses on the recovery solutions for XORbased erasure codes. We point out that regenerating codes [5] have recently been proposed to minimize the recovery bandwidth in distributed storage systems. The idea is that surviving storage nodes compute and transmit linear combinations of their stored data during failure recovery. On the other hand, in XOR-based erasure codes, we do not requi...
متن کاملAsymptotically MDS Array BP-XOR Codes
Belief propagation or message passing on binary erasure channels (BEC) is a low complexity decoding algorithm that allows the recovery of message symbols based on bipartite graph prunning process. Recently, array XOR codes have attracted attention for storage systems due to their burst error recovery performance and easy arithmetic based on Exclusive OR (XOR)-only logic operations. Array BP-XOR...
متن کاملBelief Propagation Decodable XOR based Erasure Codes For Distributed Storage Systems
LDPC codes and digital fountain techniques have received significant attention from both academics and industry in the past few years. There have also been extensive interests in applying LDPC code techniques to distributed storage systems such as cloud data storage in recent years. This paper carries out the theoretical analysis on the feasibility and performance issues for applying LT codes t...
متن کاملNew User-Guided and ckpt-Based Checkpointing Libraries for Parallel MPI Applications
We present design and implementation details as well as performance results for two new parallel checkpointing libraries developed by us for parallel MPI applications. The first one, a user-guided library requires from the programmer to support packing and unpacking code with an easy-to-use API using MPI constants. It uses MPI-2 collective I/O calls or a dedicated master process for checkpointi...
متن کاملHigh-fidelity reliability simulation of XOR-based erasure codes
Erasure codes are the means by which storage systems are typically made reliable. Recent high profile studies of disk failure and sector failures indicate that ever more fault tolerant erasure codes are needed. Many traditional RAID approaches, parity-check array codes (e.g.,EVENODD, RDP, and X-code), and MDS codes offer two and three disk fault tolerant schemes. There are also many novel erasu...
متن کامل