نتایج جستجو برای: coordinated checkpointing

تعداد نتایج: 48092  

Journal: :Scalable Computing: Practice and Experience 2016
Samy Sadi Belabbes Yagoubi

Checkpoint/Restart or checkpointing is a fault tolerance technique which consists on taking frequent snapshots of an application, so that, in the event of a failure, the application’s state can be restored and the application’s execution continued without necessarily restarting it. The advent of Cloud Computing brought new challenges with regard to this technique as Fault Tolerance needs to be ...

2006
Taichi Jinno Tokimasa Kamiya Motoyasu Nagata

In grid computing, system recovery is carried out using checkpoints recorded at each nodes. The resource manager must recover system with keeping global consistency to prevent Domino effect. Currently, coordinated checkpointing is widely used in which all processes can be synchronized. Considering overhead due to synchronization, we will present a coordinated checkpoint protocol using vector ti...

1996
Nuno Neves W. Kent Fuchs

This paper describes and evaluates a coordinated checkint protocol that uses time to eliminate several performance overheads that are present in traditional protocols. The time-based protocol does not have to exchange coordination messages, does not need to add information to the processes' messages, and only accesses stable storage when checkpoints are saved. This protocol uses a simple initia...

2014
Manoj Kumar Niranjan Mahesh Motwani D. Manivannan R. H. B. Netzer Guohong Cao Mukesh Singhal E. N. Elnozahy D. B. Johnson Lorenzo Alvisi Yi-Min Wang David B. Johnson M. M. Naidu Sarmistha Neogy Anupam Sinha Pradip K Das J. Makhijani M. K. Niranjan M. Motwani A. K. Sachan A. Rajput

Introduction to Distributed System Design, Google Code University, http://code. google. com/edu/parallel/dsd-tutorial. html#Basics D. Manivannan, R. H. B. Netzer & M. Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation", IEEE Trans. On Parallel & Distributed Systems, Vol. 8, No. 6, pp. 623-627 (June 1997) J. Tsai & S. Kuo, "Theoretical Analysis for Commun...

2010
Praveen Kumar Ajay Khunteta

While dealing with Mobile Distributed systems, we come across some issues like: mobility, low bandwidth of wireless channels and lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional checkpointing techniques designed for Distributed systems unsuitable for Mobile environments. In this paper, we design a ...

2011
Kurt B. Ferreira Rolf Riesen Ron Brightwell Patrick G. Bridges Dorian C. Arnold

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint str...

1993
Nitin H. Vaidya

Traditionally, distributed recovery schemes have been designed for systems consisting of multiple recovery units. Each recovery unit (RU) resides on a single processor and it can fail and recover as a whole. This report introduces the \distributed recovery unit (DRU)" abstraction as an approach for design of \hybrid" and \adaptive" recovery schemes for distributed systems. The distributed syste...

2001
Jiannong Cao G. H. Chan Tharam S. Dillon Weijia Jia

We consider the problem of designing rollback error recovery algorithms for dynamic, wide area distributed systems like the Internet. The characteristics and the scale of such a system complicate the design and performance of the algorithms. Traditional message passing based algorithms incur large overhead, in both the network traffic and message passing delay, in such a wide-area environment. ...

Journal: :IJHPCA 2005
Sriram Sankaran Jeffrey M. Squyres Brian W. Barrett Vishal Sahay Andrew Lumsdaine Jason Duell Paul Hargrove Eric Roman

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kern...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید