Improving Application Resilience through Probabilistic Task Replication
نویسندگان
چکیده
Maintaining performance in a faulty distributed computing environment is a major challenge in the design of future peta and exa-scale class systems. Better defining application resilience as a function of scale, is a key to developing reliable software systems and programming methodologies. This paper defines the resilience of a task as the survivability of that task (i.e., how well will it survive until it completes). Resilience varies with mean time to failure (MTTF) and inversely with runtime. We develop an approach for defining a resilience index(RI) for applications running on a system with a fixed MTTF. Our approach, inspired by radioactive decay, defines an application as a collection of tasks, which we model as particles with an exponential decay rate and therefore measurable half-life. We determine the probability of the number of task failures for an application using a poisson distribution over the interval of the task lifetime. Further we have developed a distributed runtime system, ARRIA, that measures both system reliability and application performance at runtime, which schedules and replicates tasks based on the probability of failure and expected runtime. We demonstrate that the resilience index can help to better define the tradeoffs for the designers of future systems and developers of parallel software. Thus, we propose a formulation of application resilience that results in a resilience index. We evaluate some initial and fundamental properties of the resilience index as they relate to application performance on high performance computing systems composed of many components, each with varying degrees of reliability.
منابع مشابه
Improving Data Grids Performance by Using Modified Dynamic Hierarchical Replication Strategy
Abstract: A Data Grid connects a collection of geographically distributed computational and storage resources that enables users to share data and other resources. Data replication, a technique much discussed by Data Grid researchers in recent years creates multiple copies of file and places them in various locations to shorten file access times. In this paper, a dynamic data replication strate...
متن کاملOperating System Support for Resilience
This paper is concerned with improving the resilience of mission-critical applications to a wide variety of failures, errors, and malicious attacks. A number of approaches have been proposed in the literature based on fault tolerance provided through replication of resources. In general, these approaches provide graceful degradation of performance to the point of failure but do not guarantee pr...
متن کاملThe Costs of Resilience in Overlay Multicast Protocols
One of the most important challenges of peer-to-peer multicast protocols lies in their ability to handle the high degree of transiency inherent to their environment. A number of techniques have been recently proposed aimed at improving the resilience of these application-layer approaches. However, achieving high delivery ratios without sacrificing end-to-end latencies or incurring additional co...
متن کاملApplication Resilience with Process Failures (1)
The notion of resiliency is concerned with constructing mission-critical applications that are able to operate through a wide variety of failures, errors, and malicious attacks. A number of approaches have been proposed in the literature based on fault tolerance achieved through replication of resources. In general, these approaches provide graceful degradation of performance to the point of fa...
متن کاملA conceptual model of critical success factors in improving the resilience of the health tourism supply chain: A case study
Introduction: Today, in dynamic environments, there are many disruptions in supply chains that negatively affect the performance and productivity of organizations; therefore, identifying the critical success factors for managing disruptions is essential. This is also true in the health tourism supply chain.The primary purpose of this study was to determine the critical success factors and desig...
متن کامل