mapreduce

CS 6332: Fall 2008 Systems for Large Data Review

2010

Guozhang Wang

MapReduce [10] gives us an appropriate model for distributed parallel computing. There are several features which are proved useful: 1) centralized job distribution. 2) Fault tolerance mechanism for both masters and workers. Although there is controversies about MapReduce capability to replace standard RDBMS [12, 13], it is reasonable that existing proposals to use MapReduce in relational data ...

متن کامل

Understanding application-level interoperability: Scaling-out MapReduce over high-performance grids and clouds

Journal: :Future Generation Comp. Syst. 2011

Saurabh Sehgal Miklós Erdélyi André Merzky Shantenu Jha

Application-level interoperability is defined as the ability of an application to utilize multiple distributed heterogeneous resources. Such interoperability is becoming increasingly important with increasing volumes of data, multiple sources of data as well as resource types. The primary aim of this paper is to understand different ways and levels in which application-level interoperability ca...

متن کامل

Formal Derivation of Distributed MapReduce

2014

Inna Pereverzeva Michael J. Butler Asieh Salehi Fathabadi Linas Laibinis Elena Troubitsyna

MapReduce is a powerful distributed data processing model that is currently adopted in a wide range of domains to efficiently handle large volumes of data, i.e., cope with the big data surge. In this paper, we propose an approach to formal derivation of the MapReduce framework. Our approach relies on stepwise refinement in Event-B and, in particular, the event refinement structure approach – a ...

متن کامل

Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads

Journal: :PVLDB 2012

Yanpei Chen Sara Alspaugh Randy H. Katz

Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query pro...

متن کامل

Improving the Load Balance of MapReduce Operations based on the Key Distribution of Pairs

Journal: :CoRR 2014

Liya Fan Bo Gao Xi Sun Fa Zhang Zhiyong Liu

Load balance is important for MapReduce to reduce job duration, increase parallel efficiency, etc. Previous work focuses on coarse-grained scheduling. This study concerns finegrained scheduling on MapReduce operations. Each operation represents one invocation of the Map or Reduce function. Scheduling MapReduce operations is difficult due to highly skewed operation loads, no support to collect w...

متن کامل

Data Cloud for Distributed Data Mining via Pipelined MapReduce

2011

Zhiang Wu Jie Cao Changjian Fang

Distributed data mining (DDM) which often utilizes autonomous agents is a process to extract globally interesting associations, classifiers, clusters, and other patterns from distributed data. As datasets double in size every year, moving the data repeatedly to distant CPUs brings about high communication cost. In this paper, data cloud is utilized to implement DDM in order to move the data rat...

متن کامل

Efficient Skyline Computation in MapReduce

2014

Kasper Mullesgaard Jens Laurits Pederseny Hua Lu Yongluan Zhou

Skyline queries are useful for finding interesting tuples from a large data set according to multiple criteria. The sizes of data sets are constantly increasing and the architecture of back-ends are switching from single-node environments to non-conventional paradigms like MapReduce. Despite the usefulness of skyline queries, existing works on skyline computation in MapReduce do not take full a...

متن کامل

Adaptive Optimal Control of MapReduce Performance, Availability and Costs

2017

Sophie Cerf Mihaly Berekmeri Bogdan Robu Nicolas Marchand Sara Bouchenak

MapReduce is a popular programming model for distributed data processing and Big Data applications running on clouds. Extensive research has been conducted either to improve the dependability or to increase performance of MapReduce, ranging from adaptive and on-demand fault-tolerance solutions, adaptive task scheduling techniques to optimized job execution mechanisms. This paper investigates an...

متن کامل

Towards Optimizing Hadoop Provisioning in the Cloud

2009

Karthik Kambatla Abhinav Pathak Himabindu Pucha

Data analytics is becoming increasingly prominent in a variety of application areas ranging from extracting business intelligence to processing data from scientific studies. MapReduce programming paradigm lends itself well to these data-intensive analytics jobs, given its ability to scale-out and leverage several machines to parallely process data. In this work we argue that such MapReduce-base...

متن کامل

A Research of MapReduce with GPU Acceleration

2012

Miao Xin Hao Li Joan Lu

MapReduce is an efficient distributed computing model on large data sets. The data processing is fully distributed on huge amount of nodes, and a MapReduce cluster is of highly scalable. However, single-node performance is gradually to be a bottleneck in computeintensive jobs, which makes it difficult to extend the MapReduce model to wider application fields such as largescale image processing ...

متن کامل