mapreduce

An Empirical Evaluation of MapReduce under Interruptions

2011

Hui Jin Xi Yang Xian-He Sun Ioan Raicu

The presence of interruptions is an unwanted but inevitable fact that all large-scale distributed computing systems have to face. The interruptions are more prevailed for MapReduce applications, as often MapReduce runs on the top of the commodity hardware based clusters, which are more vulnerable than traditional HEC systems. The problem is further exaggerated when running MapReduce application...

متن کامل

Sorting, Searching, and Simulation in the MapReduce Framework

2011

Michael T. Goodrich Nodari Sitchinava Qin Zhang

In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP paralle...

متن کامل

Security and Privacy Aspects in MapReduce on Clouds: A Survey

Journal: :Computer Science Review 2016

Philip Derbeko Shlomi Dolev Ehud Gudes Shantanu Sharma

MapReduce is a programming system for distributed processing large-scale data in an efficient and fault tolerant manner on a private, public, or hybrid cloud. MapReduce is extensively used daily around the world as an efficient distributed computation tool for a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern ma...

متن کامل

Thesis Report: Resource Utilization Provisioning in MapReduce

Journal: :CoRR 2012

Hamidreza Barati Nasrin Jaberi

In this thesis report, we have a survey on state-of-the-art methods for modelling resource utilization of MapReduce applications regard to its configuration parameters. After implementation of one of the algorithms in literature, we tried to find that if CPU usage modelling of a MapReduce application can be used to predict CPU usage of another MapReduce application.

متن کامل

Comparative Study Parallel Join Algorithms for MapReduce environment

2012

A. Pigul

There are the following techniques that are used to analyze massive amounts of data: MapReduce paradigm, parallel DBMSs, column-wise store, and various combinations of these approaches. We focus in a MapReduce environment. Unfortunately, join algorithms is not directly supported in MapReduce. The aim of this work is to generalize and compare existing equi-join algorithms with some optimization ...

متن کامل

Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud

2014

Keke Chen Shumin Guo James Powers Fengguang Tian

Running MapReduce programs in the cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge or job finish time for a specific job? An important step towards this ultimate goal is modeling the cost of MapReduce program. In this chapter, we study the whole process of MapReduce processing and build 1

متن کامل

Parallel Sorted Neighborhood Blocking with MapReduce

2011

Lars Kolb Andreas Thor Erhard Rahm

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce j...

متن کامل

Hadoop Mapreduce Framework in Big Data Analytics

2014

Vidyullatha Pellakuri Rajeswara Rao

As Hadoop is a Substantial scale, open source programming system committed to adaptable, disseminated, information concentrated processing. Hadoop [1] Mapreduce is a programming structure for effectively composing requisitions which prepare boundless measures of information (multi-terabyte information sets) inparallel on extensive bunches (many hubs) of merchandise fittings in a dependable, sho...

متن کامل

TrustedMR: A Trusted MapReduce System Based on Tamper Resistance Hardware

2015

Quoc-Cuong To Benjamin Nguyen Philippe Pucheral

With scalability, fault tolerance, ease of programming, and flexibility, MapReduce has gained many attractions for large-scale data processing. However, despite its merits, MapReduce does not focus on the problem of data privacy, especially when processing sensitive data, such as personal data, on untrusted infrastructure. In this paper, we investigate a scenario based on the Trusted Cells para...

متن کامل

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Journal: :PVLDB 2015

Juwei Shi Yunjie Qiu Umar Farooq Minhas Limei Jiao Chen Wang Berthold Reinwald Fatma Özcan

MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a s...

متن کامل