On the usability of Hadoop MapReduce, Apache Spark & Apache flink for data science
نویسندگان
چکیده
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of masters students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink as platforms for batch-oriented big data analysis. This study starts an exploration of the factors that make big data platforms more – or less – effective for users in data science.
منابع مشابه
A comparison on scalability for batch big data processing on Apache Spark and Apache Flink
*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...
متن کاملDdup - towards a deduplication framework utilising apache spark
This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14]...
متن کاملCloudflow - enabling faster biomedical pipelines with MapReduce and Spark
For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudflow a high-level pipeline framework that allows users to create sophisticated biomedical pipelines usi...
متن کاملGood parallel software development practices. Apache Spark case
Recently, Spark as data processing engine, gained huge popularity because of better performance in terms of the speed. Developers of Spark claim that it may outperform Hadoop MapReduce in 100 times in memory and 10 times on disk [1]. This paper outlines which innovations improved speed and how. In order to investigate improvements, I analysed technical documentation, which is available, since b...
متن کاملFlame-MR: An event-driven architecture for MapReduce applications
Nowadays, many organizations analyze their data with the MapReduce paradigm, most of them using the popular Apache Hadoop framework. As the data size managed by MapReduce applications is steadily increasing, the need for improving the Hadoop performance also grows. Existing modifications of Hadoop (e.g., Mellanox Unstructured Data Accelerator) attempt to improve performance by changing some of ...
متن کامل