An Algorithm for Tolerating Crash Failures in Distributed Systems

نویسندگان

Vincenzo De Florio

Geert Deconinck

Rudy Lauwereins

چکیده

In the framework of the ESPRIT project 28620 “TIRAN” (tailorable fault tolerance frameworks for embedded applications), a toolset of error detection, isolation, and recovery components is being designed to serve as a basic means for orchestrating application-level fault tolerance. These tools will be used either as stand-alone components or as the peripheral components of a distributed application, that we call “the backbone”. The backbone is to run in the background of the user application. Its objectives include (1) gathering and maintaining error detection information produced by TIRAN components like watchdog timers, trap handlers, or by external detection services working at kernel or driver level, and (2) using this information at error recovery time. In particular, those TIRAN tools related to error detection and fault masking will forward their deductions to the backbone that, in turn, will make use of this information to orchestrate error recovery, requesting recovery and reconfiguration actions to those tools related to error isolation and recovery. Clearly a key point in this approach is guaranteeing that the backbone itself tolerates internal and external faults. In this article we describe one of the means that are used within the TIRAN backbone to fulfill this goal: a distributed algorithm for tolerating crash failures triggered by faults affecting at most all but one of the components of the backbone or at most all but one of the nodes of the system. We call this the algorithm of mutual

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supporting customized failure models for distributed software

The cost of employing software fault-tolerance techniques in distributed systems is strongly related to the type of failures to be tolerated. For example, in terms of the amount of redundancy required and execution time, tolerating a processor crash is much cheaper than tolerating arbitrary (or Byzantine) failures. The tradeoff, of course, is that making stronger assumptions about failures less...

متن کامل

An Adaptive Algorithm for Tolerating Value Faults and Crash Failures

The AQuA architecture provides adaptive fault tolerance to CORBA applications by replicating objects and providing a high-level method that an application can use to specify its desired level of dependability. This paper presents the algorithms that AQuA uses, when an application’s dependability requirements can change at runtime, to tolerate both value faults in applications and crash failures...

متن کامل

Distributed Consensus Resilient to Both Crash Failures and Strategic Manipulations

In this paper, we study distributed consensus in synchronous systems subject to both unexpected crash failures and strategic manipulations by rational agents in the system. We adapt the concept of collusion-resistant Nash equilibrium to model protocols that are resilient to both crash failures and strategic manipulations of a group of colluding agents. For a system with n distributed agents, we...

متن کامل

Leader Election in Distributed Systems with Crash Failures

Leader election is an important problem in distributed computing. Garcia-Molina's Bully Algorithm is a classic solution to leader election in synchronous systems with crash failures. This paper shows that the Bully Algorithm can be easily adapted for use in asynchronous systems. First, we re-write the Bully Algorithm to use a failure detector, instead of explicit time-outs; this yields a modula...

متن کامل

Self-stabilization of Byzantine Protocols

Awareness of the need for robustness in distributed systems increases as distributed systems become integral parts of day-to-day systems. Self-stabilizing while tolerating ongoing Byzantine faults are wishful properties of a distributed system. Many distributed tasks (e.g. clock synchronization) possess e cient non-stabilizing solutions tolerating Byzantine faults or conversely non-Byzantine bu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

An Algorithm for Tolerating Crash Failures in Distributed Systems

نویسندگان

چکیده

منابع مشابه

Supporting customized failure models for distributed software

An Adaptive Algorithm for Tolerating Value Faults and Crash Failures

Distributed Consensus Resilient to Both Crash Failures and Strategic Manipulations

Leader Election in Distributed Systems with Crash Failures

Self-stabilization of Byzantine Protocols

عنوان ژورنال:

اشتراک گذاری