coordinated checkpointing

Optimal Checkpointing Period: Time vs. Energy

2013

Guillaume Aupy Anne Benoit Thomas Hérault Yves Robert Jack J. Dongarra

This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic ...

متن کامل

Some Thoughts on Distributed Recovery ( preliminary

1994

Nitin H. Vaidya

This report deals with some aspects of distributed recovery. The report is divided into multiple parts, each part introducing a problem and a solution. The intent of this report is to present a medley of preliminary ideas, more detailed treatment may be presented elsewhere. The report deals with the following problems: A single processor failure tolerance scheme based on the distributed recover...

متن کامل

A Distributed Consistent Global Checkpoint Algorithm with a Minimum Number of Checkpoints

1997

Yoshifumi Manabe

A distributed coordinated checkpointing algorithm is shown. A consistent global checkpoint is a set of states in which no message is recorded as received in one process and as not yet sent in another process. This algorithm obtains a consistent global checkpoint for any checkpoint initiation by any process. Under Chandy and Lamport’s assumption that one consistent global checkpoint is obtained ...

متن کامل

IMCLA: Performance Evaluation of Integrated Multilevel Checkpointing Algorithms using Checkpointing Efficiency

Journal: :International Journal of Computing and Digital Systems 2013

متن کامل

Taking Point Decision Mechanism for Page-level Incremental Checkpointing based on Cost Analysis of Process Execution Time

Journal: :J. Inf. Sci. Eng. 2007

Sangho Yi Junyoung Heo Yookun Cho Jiman Hong

Incremental checkpointing, which is intended to minimize checkpointing overhead, saves only the modified pages of a process. This means that in incremental checkpointing, the time consumed for checkpointing varies according to the amount of modified pages. Thus, efficient intervals of checkpointing have to be determined on run-time of a process. In this paper, we present an efficient and adapti...

متن کامل

A Dynamic Checkpoint Interval Decision Algorithm for Live Migration-Based Drone-Recovery System

Journal: :Drones 2023

Numerous services and applications have been developed to monitor anomalies or collect various sensing information in large-scale monitoring areas using drones. Nonetheless, interruptions of drone missions such occasionally occur due network errors, low battery levels, physical defects, as damage the rotor propeller. Checkpointing is a technique that periodically saves system’s state, allowing ...

متن کامل

A Fast And Efficient Non-Blocking Coordinated Checkpointing Approach For Distributed Systems

2006

Bidyut Gupta Shahram Rahimi

In this paper, we have presented an efficient non-blocking coordinated checkpointing algorithm for distributed systems. The distinct advantages of the proposed algorithm are the following. It produces a consistent set of checkpoints, without the overhead of taking temporary checkpoints; the algorithm also makes sure that only few processes are required to take checkpoints in its any execution; ...

متن کامل

An Extended Home-ased Coherence Protocol for Causally Consistent Replicated Read-Write Objects

2003

Jerzy Brzezinski Michal Szychowiak

This paper considers the reliability of software Distributed Shared Memory systems where the unit of sharing is a persistent read-write object. We present an extended coherence protocol for causal consistency model, which integrates replication management with independent checkpointing. It uses a novel coordinated burst checkpoint operation in order to replicate consistent checkpoints of shared...

متن کامل

Some Thoughts on Distributed Recovery ( preliminary version )

2012

Nitin H. Vaidya

This report deals with some aspects of distributed recovery. The report is divided into multiple parts, each part introducing a problem and a solution. The intent of this report is to present a medley of preliminary ideas, more detailed treatment may be presented elsewhere. The report deals with the following problems: A single processor failure tolerance scheme based on the distributed recover...

متن کامل

Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF

2005

Namyoon Woo Hyungsoo Jung Dongin Shin Hyuck Han Heon Young Yeom Taesoon Park

This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equip ment The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intens...

متن کامل