Reliable Communication for Datacenters

نویسنده

  • Mahesh Balakrishnan
چکیده

Datacenter platforms have dominated the systems landscape over the last decade, offering applications the promise of scalability, availability and responsiveness at very low costs. Delivering on this promise is a significant research challenge — datacenters consist of thousands of inexpensive fault-prone components, running commodity operating systems and protocols ill-fitted for high-performance applications. Further, datacenter applications have unconventional scaling requirements and bursty workloads that frequently push systems into delays and down-time. This thesis seeks to provide systems with low-latency primitives for reliable communication that are fundamentally scalable and robust to faults and attacks. Our focus is on the design and implementation of two protocols: Maelstrom and Ricochet. Mael-strom is a transparent network appliance for reliable and rapid communication over high-speed optical networks between datacenters. Ricochet is a low-latency messag-ing layer for clustered applications running within datacenters. An important aspect of these two protocols is the use of proactive fault-handling techniques such as Forward Error Correction (FEC) and gossip to achieve low delays and stable performance. Re-active protocols do too much too late, imposing extra delays and overheads that often send systems into spirals of degrading performance. In contrast, proactive protocols recover from faults almost instantly and impose stable, predictable overheads that prevent transient overloads and failures from translating into application unavailability. Both Maelstrom and Ricochet use fast and simple XOR operations in novel ways that allow datacenter applications to scale in new and vital dimensions. In particular, they create XORs at strategic points in the network (respectively, within an appliance and at multicast receivers) and from different data channels to obtain excellent recovery and latency properties. Together, these protocols enable the development of highly available applications that coordinate within and across datacenters while maintaining scalable and robust responsiveness. Mahesh Balakrishnan grew up in New Delhi, where he spent the 90s trading computer games on floppy disks, downloading them on dial-up modems from choked BBSes and occasionally hacking them to get infinite health. A career in Computer Science seemed inevitable and he joined Georgia Tech in 2000. Three years later, he graduated with a B.S. degree and the firm resolve to find a career path that did not involve waking up before noon. He entered Cornell's Ph.D. program in the fall of 2003 and spent the next five years exercising his right to work whenever he felt like it. In the summer of 2006 Mahesh interned at Microsoft Research Silicon Valley, learning …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Communication-Aware Traffic Stream Optimization for Virtual Machine Placement in Cloud Datacenters with VL2 Topology

By pervasiveness of cloud computing, a colossal amount of applications from gigantic organizations increasingly tend to rely on cloud services. These demands caused a great number of applications in form of couple of virtual machines (VMs) requests to be executed on data centers’ servers. Some of applications are as big as not possible to be processed upon a single VM. Also, there exists severa...

متن کامل

Topology-aware Gossip Dissemination for Large-scale Datacenters

Gossip-based protocols are very robust and are able to distribute the load uniformly among all processes. Furthermore, gossip-protocols circumvent the oscillatory phenomena that are known to occur with other forms of reliable multicast. As a result, they are excellent candidates to support the dissemination of information in large-scale datacenters. However, in this context, topology oblivious ...

متن کامل

Sprinkler - Reliable Broadcast for Geographically Dispersed Datacenters

This paper describes and evaluates Sprinkler, a reliable highthroughput broadcast facility for geographically dispersed datacenters. For scaling cloud services, datacenters use caching throughout their infrastructure. Sprinkler can be used to broadcast update events that invalidate cache entries. The number of recipients can scale to many thousands in such scenarios. The Sprinkler infrastructur...

متن کامل

Message Futures: Fast Commitment of Transactions in Multi-datacenter Environments

Geo-replication of large Internet services is increasingly deployed for better data locality and fault tolerance. Maintaining consistency across datacenters is expensive and requires wide-area communication. This renders current solutions to either settle for weaker forms of consistency or suffer from large delays. In this work we present Message Futures, a strongly consistent concurrency contr...

متن کامل

Supporting Large-scale Continuous Stream Datacenters via Pub/Sub Middleware and Adaptive Transport Protocols

Large-scale datacenters that handle continuous data streams require scalable and flexible communication infrastructure. The scalability of publish/subscribe (pub/sub) middleware coupled with fine-grained quality-of-service (QoS) support and adaptive transport protocols constitutes a promising area of research to address the challenges of these types of large-scale datacenters. This paper descri...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008