Cloud-agnostic architectures for machine learning based on Apache Spark

نویسندگان

چکیده

Reference architectures for Big Data, machine learning and stream processing include not only recommended practices interconnected building blocks but considerations scalability, availability, manageability, security as well. However, the automated deployment of multi-VM platforms on various clouds leveraging such reference may raise several issues. The paper focuses particularly widespread Apache Spark Data platform baseline Occopus cloud-agnostic orchestrator tool. set new generation are configurable by human-readable descriptors according to available resources cloud-providers, offers components Jupyter Notebook, RStudio, HDFS, Kafka. These pre-configured can be automatically deployed even data scientist on-demand, using a multi-cloud approach wide range cloud systems like Amazon AWS, Microsoft Azure, OpenStack, OpenNebula, CloudSigma, etc. enables scaling cluster-oriented (such Spark) instantiated architectures. presented solution was successfully used in Hungarian Comparative Agendas Project (CAP) Institute Political Science classify newspaper articles.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MLlib: Machine Learning in Apache Spark

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark’s open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shi...

متن کامل

Benchmarking Apache Spark with Machine Learning Applications

We benchmarked Apache Spark with a popular parallel machine learning training application, Distributed Stochastic Gradient Descent for Matrix Factorization [5] and compared the Spark implementation with alternative approaches for communicating model parameters, such as scheduled pipelining using POSIX socket or MPI, and distributed shared memory (e.g. parameter server [13]). We found that Spark...

متن کامل

Parallel Maritime Traffic Clustering Based on Apache Spark

Maritime traffic patterns extraction is an essential part for maritime security and surveillance and DBSCANSD is a density based clustering algorithm extracting the arbitrary shapes of the normal lanes from AIS data. This paper presents a parallel DBSCANSD algorithm on top of Apache Spark. The project is an experimental research work and the results shown in this paper is preliminary. The exper...

متن کامل

SystemML: Declarative Machine Learning on Spark

The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express cu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Advances in Engineering Software

سال: 2021

ISSN: ['1873-5339', '0965-9978']

DOI: https://doi.org/10.1016/j.advengsoft.2021.103029