Improving Message Logging Protocols Scalability through Distributed Event Logging

نویسندگان

  • Thomas Ropars
  • Christine Morin
چکیده

Message logging is an attractive solution to provide fault tolerance for message passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well known optimization that allows to save messages payload in the sender memory and so only the events corresponding to message receptions have to be logged reliably using an event logger. In existing work on message logging, the event logger has always been considered as a centralized process, limiting message logging protocols scalability. In this paper, we propose a distributed event logger. This new event logger takes advantage of multi-cores processors to be executed in parallel with application processes. It makes use of the nodes’ volatile memory to save events reliably. We propose a simple gossip-based dissemination protocol to make application processes aware of new stable events. We evaluated our distributed event logger in the Open MPI library with an optimistic and a pessimistic message logging protocol. Experiments show that distributed event logging improves message logging protocols scalability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Tolerating Failures of Mobile Hosts and Mobile Support Stations

In this paper, we present two fault-tolerant protocols for mobile computing systems; a causal message logging protocol and a receiver-based pessimistic message logging protocol for tolerating failures of mobile hosts (MHs) and mobile support stations (MSSs) respectively. The systems raise several constraints such as limited life of battery power, mobility and disconnection of hosts and lack of ...

متن کامل

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hi...

متن کامل

Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery

In the rollback recovery of large-scale long-running applications in a distributed environment, pessimistic message logging protocols enable failed processes to recover independently, though at the expense of logging every message synchronously during fault-free execution. In contrast, coordinated checkpointing protocols avoid message logging, but they are poor in scalability with a sharply inc...

متن کامل

Correlated set coordination in fault tolerant message logging protocols for many-core clusters

With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the...

متن کامل

Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010