Asynchronous Complex Analytics in a Distributed Dataflow Architecture

نویسندگان

  • Joseph Gonzalez
  • Peter Bailis
  • Michael I. Jordan
  • Michael J. Franklin
  • Joseph M. Hellerstein
  • Ali Ghodsi
  • Ion Stoica
چکیده

Scalable distributed dataflow systems have recently experienced widespread adoption, with commodity dataflow engines such as Hadoop and Spark, and even commodity SQL engines routinely supporting increasingly sophisticated analytics tasks (e.g., support vector machines, logistic regression, collaborative filtering). However, these systems’ synchronous (often Bulk Synchronous Parallel) dataflow execution model is at odds with an increasingly important trend in the machine learning community: the use of asynchrony via shared, mutable state (i.e., data races) in convex programming tasks, which has—in a single-node context—delivered noteworthy empirical performance gains and inspired new research into asynchronous algorithms. In this work, we attempt to bridge this gap by evaluating the use of lightweight, asynchronous state transfer within a commodity dataflow engine. Specifically, we investigate the use of asynchronous sideways information passing (ASIP) that presents single-stage parallel iterators with a Volcano-like intra-operator iterator that can be used for asynchronous information passing. We port two synchronous convex programming algorithms, stochastic gradient descent and the alternating direction method of multipliers (ADMM), to use ASIPs. We evaluate an implementation of ASIPs within on Apache Spark that exhibits considerable speedups as well as a rich set of performance trade-offs in the use of these asynchronous algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Asynchronous Complex Analytics in a Distributed Dataflow Architecture (Extended Abstract)

Scalable distributed dataflow systems have recently experienced widespread adoption, with commodity dataflow engines such as Hadoop and Spark, and even commodity SQL engines routinely supporting increasingly sophisticated analytics tasks (e.g., support vector machines, logistic regression, collaborative filtering). However, these systems’ synchronous (often Bulk Synchronous Parallel) dataflow e...

متن کامل

Functional Reactive Stream Processing for Data-centric Publish/Subscribe Systems

The Internet of Things (IoT) paradigm has given rise to a new class of applications wherein complex data analytics must be performed in real-time on large volumes of fast-moving, heterogeneous sensor-generated data. Such data streams are often unbounded and must be processed in a distributed and parallel manner to ensure timely processing and delivery to interested subscribers. Dataflow archite...

متن کامل

Industry Paper: Reactive Stream Processing for Data-centric Publish/Subscribe

The Internet of Things (IoT) paradigm has given rise to a new class of applications wherein complex data analytics must be performed in real-time on large volumes of fastmoving and heterogeneous sensor-generated data. Such data streams are often unbounded and must be processed in a distributed and parallel manner to ensure timely processing and delivery to interested subscribers. Dataflow archi...

متن کامل

Decentralized and Cooperative Multi-Sensor Multi-Target Tracking With Asynchronous Bearing Measurements

Bearings only tracking is a challenging issue with many applications in military and commercial areas. In distributed multi-sensor multi-target bearings only tracking, sensors are far from each other, but are exchanging data using telecommunication equipment. In addition to the general benefits of distributed systems, this tracking system has another important advantage: if the sensors are suff...

متن کامل

Lightweight Asynchronous Snapshots for Distributed Dataflows

Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Thos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1510.07092  شماره 

صفحات  -

تاریخ انتشار 2015