The TPT-RAID Architecture for Box-Fault Tolerant Storage Systems
نویسندگان
چکیده
TPT-RAID is a multi-box RAID wherein each ECC group comprises at most one block from any given storage box, and can thus tolerate a box failure. It extends the idea of an out-of-band SAN controller into the RAID: data is sent directly between hosts and targets and among targets, and the RAID controller supervises ECC calculation by the targets. By preventing a communication bottleneck in the controller, excellent scalability is achieved while retaining the simplicity of centralized control. TPT-RAID, whose controller can be a software module within an out-of-band SAN controller, moreover conforms to a conventional switched network architecture, whereas an inband RAID controller would either constitute a communication bottleneck or would have to also be a full-fledged router. The design is validated in an InfiniBand-based prototype using iSCSI and iSER, and required changes to relevant protocols are introduced.
منابع مشابه
Stripped mirroring RAID architecture
Redundant arrays of independent disks (RAID) provide an ecient stable storage system for parallel access and fault tolerance. The most common fault tolerant RAID architecture is RAID-1 or RAID-5. The disadvantage of RAID-1 lies in excessive redundancy, while the write performance of RAID-5 is only 1/4 of that of RAID-0. In this paper, we propose a high performance and highly reliable disk arra...
متن کاملRAID-RMS: A fault tolerant stripped mirroring RAID architecture for distributed systems
Disk arrays, or RAIDs, have become the solution to increase the capacity, bandwidth and reliability of most storage systems. In spite of its high redundancy level, disk mirroring is a popular RAID paradigm, because replicating data also doubles the bandwidth available for processing read requests, improves the reliability and achieves fault tolerance. In this paper, we present a new RAID archit...
متن کاملReliability Models for Highly Fault-tolerant Storage Systems
We found that a reliability model commonly used to estimate Mean-Time-To-Data-Loss (MTTDL), while suitable for modeling RAID 0 and RAID 5, fails to accurately model systems having a fault-tolerance greater than 1. Therefore, to model the reliability of RAID 6, Triple-Replication, or k-of-n systems requires an alternate technique. In this paper, we explore some alternatives, and evaluate their e...
متن کاملFault-Tolerant Distributed Mass Storage for LHC Computing
In this paper we present the concept and first prototyping results of a modular fault-tolerant distributed mass storage architecture for large Linux PC clusters as they are deployed by the upcoming particle physics experiments. The device masquerading technique using an Enhanced Network Block Device (ENBD) enables local RAID over remote disks as the key concept of the ClusterRAID system. The bl...
متن کاملReliability Markov models are becoming unreliable ( WIP submission )
Markov models have traditionally been used to understand the reliability of storage systems. They provide intuition about the sensitivity of storage system reliability to changes in disk failure rates, rebuild rates, sector failure rates, scrubbing rates, and storage capacity. Unfortunately, as we move towards multi-disk fault tolerant storage systems, i.e., storage systems that tolerate two or...
متن کامل