The Design and Implementation of a Fault-Tolerant Cluster Manager
نویسندگان
چکیده
Cluster management middleware schedules tasks on a cluster, controls access to shared resources, provides for task submission and monitoring, and coordinates the cluster’s fault tolerance mechanisms. Thus, reliable continuous operation of the management middleware is a prerequisite to the reliable operation of the cluster. Hence, the management middleware should tolerate a wide class of faults with minimal interruptions to management operations. This paper describes design considerations and implementation details of cluster mangement middleware for high performance computing in space, where fault rates are significantly higher than for earth-bound systems. We describe key detection, recovery, and reconfiguration mechanisms for different components of the system. The system is based on centralized decision making. Unlike other systems, the decision making capability is protected by active replication and the ability to restore the decision maker to full operational and fault tolerance capabilities following node failure. The management middleware is used to provide the application tasks with an out-of-band signaling capability that can be a key building block for application-level fault tolerance mechanisms. The middleware described has been implemented as part of the UCLA FaultTolerant Cluster Testbed (FTCT) project. Based on measurements of this implementation, we present preliminary evaluation of the overheads incurred by the management middleware.
منابع مشابه
Fault-tolerant Cluster Management for Reliable High-performance Computing
Clusters of COTS workstations/PCs are commonly used to implement cost-effective high-performance systems. A central coordinator/manager is often the simplest way to implement many of the operations required for managing these distributed systems. These operations include scheduling of parallel tasks, coordination of access to limited resources, as well as high-level coordination of fault tolera...
متن کاملFault-tolerant adder design in quantum-dot cellular automata
Quantum-dot cellular automata (QCA) are an emerging technology and a possible alternative for faster speed, smaller size, and low power consumption than semiconductor transistor based technologies. Previously, adder designs based on conventional designs were examined for implementation with QCA technology. This paper utilizes the QCA characteristics to design a fault-tolerant adder that is more...
متن کاملFault-tolerant adder design in quantum-dot cellular automata
Quantum-dot cellular automata (QCA) are an emerging technology and a possible alternative for faster speed, smaller size, and low power consumption than semiconductor transistor based technologies. Previously, adder designs based on conventional designs were examined for implementation with QCA technology. This paper utilizes the QCA characteristics to design a fault-tolerant adder that is more...
متن کاملNovel Defect Terminolgy Beside Evaluation And Design Fault Tolerant Logic Gates In Quantum-Dot Cellular Automata
Quantum dot Cellular Automata (QCA) is one of the important nano-level technologies for implementation of both combinational and sequential systems. QCA have the potential to achieve low power dissipation and operate high speed at THZ frequencies. However large probability of occurrence fabrication defects in QCA, is a fundamental challenge to use this emerging technology. Because of these vari...
متن کاملDesign and Validation of Portable Communication Infrastructure for Fault-Tolerant Cluster Middleware
We describe the communication infrastructure (CI) for our fault-tolerant cluster middleware, which is optimized for two classes of communication: for the applications and for the cluster management middleware. This CI was designed for portability and for efficient operation on top of modern user-level message passing mechanisms. We present a functional fault model for the CI and show how platfo...
متن کامل