Benchmarking and Performance studies of MapReduce / Hadoop Framework on Blue Waters Supercomputer

نویسندگان

  • Manisha Gajbe
  • Kalyana Chadalavada
  • Gregory Bauer
  • William Kramer
چکیده

MapReduce is an emerging and widely used programming model for large-scale data parallel applications that require to process large amount of raw data. There are several implementations of MapReduce framework, among which Apache Hadoop is the most commonly used and open source implementaion. These frameworks are rarely deployed on supercomputers as massive as Blue Waters. We want to evaluate how such massive HPC resource can help solving large-scale data analytics, datamining problems using MapReduce / Hadoop framework. In this paper we present our studies and detailed performance analysis of MapReduce / Hadoop framework on Blue Waters Supercomputer. We have used standard popular MapReduce benchmark suite that represents wide range of MapReduce applications with various computation and data densities. Also, we are planning to use Intel HiBench Haddop Benchamrk Suite in future. We identify few factors that significantly affect the performance of MapReduce / Hadoop and shed light on few alternatives that can improve the overall performance of MapReduce techniques on the system. The results we have obtained strengthen our belief in possibility of using massive specialized superrcomputers to tackle big data problems. We demonstrate the initial performance of the MapReduce / Hadoop framework with encoraging results and we are confident that the massive traditional High Perfromance Computaing resource can be useful in tackling the big-data research challneges and in solving large-scale data analytics, data-mining problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Parameterizable benchmarking framework for designing a MapReduce performance model

In MapReduce environments, many applications have to achieve different performance goals for producing time relevant results. One of typical user questions is how to estimate the completion time of a MapReduce program as a function of varying input dataset sizes and given cluster resources. In this work, we offer a novel performance evaluation framework for answering this question. We analyze t...

متن کامل

Some Workload Scheduling Alternatives in a High Performance Computing Environment

Clusters of commodity microprocessors have overtaken custom-designed systems as the high performance computing (HPC) platform of choice. The design and optimization of workload scheduling systems for clusters has been an active research area. This paper surveys some examples of workload scheduling methods used in large-scale applications such as Google, Yahoo, and Amazon that use a MapReduce pa...

متن کامل

Performance Impact of Data Locality in MapReduce on Hadoop

As the foundation for MapReduce processing, Hadoop is one of the fundamental technologies in big data analytics. Hadoop breaks up large data into data blocks, replicates them, and stores them in a distributed storage system. Data blocks can be placed in a machine where the data will be processed (data local), in a machine in the same rack (rack-local), or in a machine in a different rack (off-r...

متن کامل

An Efficient Approach to Optimize the Performance of Massive Small Files in Hadoop MapReduce Framework

The most popular open source distributed computing framework called Hadoop was designed by Doug Cutting and his team, which involves thousands of nodes to process and analyze huge amounts of data called Big Data. The major core components of Hadoop are HDFS (Hadoop Distributed File System) and MapReduce. This framework is the most popular and powerful for store, manage and process Big Data appl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015