Rejuvenation and Failure Detection in Partitionable Systems
نویسندگان
چکیده
Certain gateways (e.g., some cable or DSL modems) are known to have low reliability and low availability. Most failures of these devices can however be “fixed” by rejuvenating the device after a failure has been detected. Such a detection based rejuvenation strategy permits increasing the availability of these gateways. In the considered scenario, rejuvenation is non-trivial since a failure of such a gateway will leave it partitioned away from the network. In particular, network operators that want to rejuvenate these gateways are in a different network partition, and can therefore not initiate a remote rejuvenation. In this paper we propose a failure detection based rejuvenation service and a remote detection service. The rejuvenation service detects and fixes “soft” failures automatically (in one partition), and the detection service detects (in another partition) all rejuvenations exactly once, within a bounded amount of time, even when the gateway is rejuvenated consecutively. The detection service also allows the detection of “hard” failures, and filtering of notifications of soft failures.
منابع مشابه
Optimal Rejuvenation Scheduling of Distributed Computation Based on Dynamic Programming
Recently, a complementary approach to handle transient software failures, called software rejuvenation, is becoming popular as a proactive fault management technique in operational software systems. In this study, we develop the optimal scheduling algorithms to trigger software rejuvenation in distributed computation circumstance. In particular, we focus on two different computation circumstanc...
متن کاملImplementing Diamond P with Bounded Messages on a Network of ADD Channels
We present an implementation of the eventually perfect failure detector (♦P ) from the original hierarchy of the Chandra-Toueg [4] oracles on an arbitrary partitionable network composed of unreliable channels that can lose and reorder messages. Prior implementations of ♦P have assumed different partially synchronous models ranging from bounded point-to-point message delay and reliable communica...
متن کاملHow the Time-Before-Failure Reacts to Periodic Rejuvenation
Rebooting is one of the commonly used approaches to recover from undesired crash or performance degradation in software systems. Recently, however, planned and periodic restart or rejuvenation has been proposed as a reliability management tool for avoiding unwanted failure of long-running systems. This paper presents an interesting observation that periodic rejuvenation alters the lifetime dist...
متن کاملSelf-healing in payment switches with a focus on failure detection using State Ma- chine-based approaches
Composition, change and complexity have attracted ev- eryone’s attention towards Self-Adaptive systems. These systems, inspired by the human body, are capable of adapting to changes in the inner and outer environment. The main objective of this study is to achieve a more con- venient availability for e-banking services in the payment switch, using self-healing systems and focusing on the failur...
متن کاملQuiescent Reliable Communication and Quiescent Consensus in Partitionable Networks
We consider partitionable networks with process crashes and lossy links, and focus on the problems of reliable communication and consensus for such networks. For both problems we seek algorithms that are quiescent, i.e., algorithms that eventually stop sending messages. We first tackle the problem of reliable communication for partitionable networks by extending the results of [ACT97a]. In part...
متن کامل