apache spark

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

Journal: :Scalable Computing: Practice and Experience 2016

Lukas Forer Enis Afgan Hansi Weißensteiner Davor Davidovic Günther Specht Florian Kronenberg Sebastian Schönherr

For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudflow a high-level pipeline framework that allows users to create sophisticated biomedical pipelines usi...

متن کامل

Novel Apache Spark based Algorithm to Solve Dirichlet Problem for Poisson Equation in 3D Computational Domain

Journal: :JCS 2016

Adai Shomanov Madina Mansurova

Corresponding Author: Shomanov Aday Department of Computer Science, al-Farabi Kazakh National University, Almaty, Kazakhstan Email: [email protected] Abstract: Parallel computations are essential tool in solving large-scale computationally demanding problems. Due to large diversity and heterogeneity of the currently available parallel processing techniques and paradigms it is usually diff...

متن کامل

Scalability Potential of BWA DNA Mapping Algorithm on Apache Spark

2015

Zaid Al-Ars Hamid Mushtaq

This paper analyzes the scalability potential of embarrassingly parallel genomics applications using the Apache Spark big data framework and compares their performance with native implementations as well as with Apache Hadoop scalability. The paper uses the BWA DNA mapping algorithm as an example due to its good scalability characteristics and due to the large data files it uses as input. Resul...

متن کامل

GeoSpark: A Cluster Computing Framework for Processing Spatial Data

2015

Jia Yu Jinxuan Wu Mohamed Sarwat

This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three nove...

متن کامل

Approximate Stream Analytics in Apache Flink and Apache Spark Streaming

Journal: :CoRR 2017

Do Le Quoc Ruichuan Chen Pramod Bhatotia Christof Fetzer Volker Hilt Thorsten Strufe

Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing — based on the chosen sample size — can make a systematic trade-off between the output accuracy and computation effi...

متن کامل

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

Journal: :CoRR 2016

Shelan Perera Ashansa Perera Kamal Hakimzadeh

Big data processing is a hot topic in today’s computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for a highly capable and efficient data processing engine. Therefore data processing engines such as Apache Flink and Apache Spark emerged in open source world ...

متن کامل

Parallel Entity Resolution with Apache Spark

Journal: :DEStech Transactions on Engineering and Technology Research 2018

متن کامل

بهبود کارایی پرسش و پاسخ در پردازش خوشه ای برای داده های نیمه ساخت یافته

پایان نامه :دانشگاه الزهراء علیها السلام 1393

سمیرا الهامی مقدم, مهران شرقی, رضا عزمی,

با توجه به رشد سریع داده ها در سال های اخیر به تکنیکی برای مدیریت این داده ها نیاز داریم. بنابراین شرکت های مختلف چارچوب هایی را برای این منظور پشنهاد داده اند. چارچوب های mapreduceو apache spark از این دست هستند. این چارجوب ها پیچیدگی های برنامه نویسی موازی همانند توزیع داده ها و زمانبندی را رفع می کنند. در این میان پرس و جو این حجم از داده ها نیز اهمیت بسیاری دارد. بنابراین در این پژوهش روشی ...

Parallel Maritime Traffic Clustering Based on Apache Spark

2014

Bo Liu

Maritime traffic patterns extraction is an essential part for maritime security and surveillance and DBSCANSD is a density based clustering algorithm extracting the arbitrary shapes of the normal lanes from AIS data. This paper presents a parallel DBSCANSD algorithm on top of Apache Spark. The project is an experimental research work and the results shown in this paper is preliminary. The exper...

متن کامل

Big Data Analysis with Apache Spark

Journal: :International Journal of Computer Applications 2017

متن کامل