coordinated checkpointing

Communication-aware Approaches for Transparent Checkpointing in Cloud Computing

Journal: :Scalable Computing: Practice and Experience 2016

Samy Sadi Belabbes Yagoubi

Checkpoint/Restart or checkpointing is a fault tolerance technique which consists on taking frequent snapshots of an application, so that, in the event of a failure, the application’s state can be restored and the application’s execution continued without necessarily restarting it. The advent of Cloud Computing brought new challenges with regard to this technique as Fault Tolerance needs to be ...

متن کامل

Coordinated Checkpointing using Vector Timestamp in Grid Computing

2006

Taichi Jinno Tokimasa Kamiya Motoyasu Nagata

In grid computing, system recovery is carried out using checkpoints recorded at each nodes. The resource manager must recover system with keeping global consistency to prevent Domino effect. Currently, coordinated checkpointing is widely used in which all processes can be synchronized. Considering overhead due to synchronization, we will present a coordinated checkpoint protocol using vector ti...

متن کامل

Using Time to Improve the Performance of Coordinated Checkpointing

1996

Nuno Neves W. Kent Fuchs

This paper describes and evaluates a coordinated checkint protocol that uses time to eliminate several performance overheads that are present in traditional protocols. The time-based protocol does not have to exchange coordination messages, does not need to add information to the processes' messages, and only accesses stable storage when checkpoints are saved. This protocol uses a simple initia...

متن کامل

Protocol for Coordinated Checkpointing using Smart Interval with Dual Coordinator

2014

Manoj Kumar Niranjan Mahesh Motwani D. Manivannan R. H. B. Netzer Guohong Cao Mukesh Singhal E. N. Elnozahy D. B. Johnson Lorenzo Alvisi Yi-Min Wang David B. Johnson M. M. Naidu Sarmistha Neogy Anupam Sinha Pradip K Das J. Makhijani M. K. Niranjan M. Motwani A. K. Sachan A. Rajput

Introduction to Distributed System Design, Google Code University, http://code. google. com/edu/parallel/dsd-tutorial. html#Basics D. Manivannan, R. H. B. Netzer & M. Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation", IEEE Trans. On Parallel & Distributed Systems, Vol. 8, No. 6, pp. 623-627 (June 1997) J. Tsai & S. Kuo, "Theoretical Analysis for Commun...

متن کامل

A Minimum-Process Coordinated Checkpointing Protocol For Mobile Distributed System

2010

Praveen Kumar Ajay Khunteta

While dealing with Mobile Distributed systems, we come across some issues like: mobility, low bandwidth of wireless channels and lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional checkpointing techniques designed for Distributed systems unsuitable for Mobile environments. In this paper, we design a ...

متن کامل

libhashckpt: Hash-Based Incremental Checkpointing Using GPU's

2011

Kurt B. Ferreira Rolf Riesen Ron Brightwell Patrick G. Bridges Dorian C. Arnold

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint str...

متن کامل

Distributed Recovery Units: An Approach for Hybrid and Adaptive Distributed Recovery

1993

Nitin H. Vaidya

Traditionally, distributed recovery schemes have been designed for systems consisting of multiple recovery units. Each recovery unit (RU) resides on a single processor and it can fail and recover as a whole. This report introduces the \distributed recovery unit (DRU)" abstraction as an approach for design of \hybrid" and \adaptive" recovery schemes for distributed systems. The distributed syste...

متن کامل

Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents

2001

Jiannong Cao G. H. Chan Tharam S. Dillon Weijia Jia

We consider the problem of designing rollback error recovery algorithms for dynamic, wide area distributed systems like the Internet. The characteristics and the scale of such a system complicate the design and performance of the algorithms. Traditional message passing based algorithms incur large overhead, in both the network traffic and message passing delay, in such a wide-area environment. ...

متن کامل

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Journal: :IJHPCA 2005

Sriram Sankaran Jeffrey M. Squyres Brian W. Barrett Vishal Sahay Andrew Lumsdaine Jason Duell Paul Hargrove Eric Roman

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kern...

متن کامل

On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing

Journal: :Future Generation Computer Systems 2015

متن کامل