An intermediate data placement algorithm for load balancing in Spark computing environment

نویسندگان

Zhuo Tang

Xiangshen Zhang

Kenli Li

Keqin Li

چکیده

Since MapReduce became an effective and popular programming framework for parallel data processing, key skew in intermediate data has become one of the important system performance bottlenecks. For solving the load imbalance of bucket containers in the shuffle process of the Spark computing framework, this paper proposes a splitting and combination algorithm for skew intermediate data blocks (SCID), which can improve the load balancing for various reduce tasks. Because the number of keys cannot be counted out until the input data are processed by map tasks, this paper provides a sampling algorithm based on reservoir sampling to detect the distribution of the keys in intermediate data. Contrasting with the original mechanism for bucket data loading, SCID sorts the data clusters of key/value tuples from each map task according to their sizes, and fills them into the relevant buckets orderly. A data cluster will be split once it exceeds the residual volume of the current bucket. After filling this bucket, the remainder clusterwill be entered into the next iteration. Through this processing, the total size of data in each bucket is roughly scheduled equally. For each map task, each reduce task should fetch the intermediate results from a specific bucket, the quantity in all buckets for a map task will balance the load of the reduce tasks. We implement SCID in Spark 1.1.0 and evaluate its performance through three widely used benchmarks: Sort, Text Search, and Word Count. Experimental results show that our algorithms can not only achieve higher overall average balancing performance, but also reduce the execution time of a job with varying degrees of data skew. © 2016 Elsevier B.V. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GASA: Presentation of an Initiative Method Based on Genetic Algorithm for Task Scheduling in the Cloud Environment

The need for calculating actions has been emerged everywhere and in any time, by advancing of information technology. Cloud computing is the latest response to such needs. Prominent popularity has recently been created for Cloud computing systems. Increasing cloud efficiency is an important subject of consideration. Heterogeneity and diversity among different resources and requests of users in ...

متن کامل

GASA: Presentation of an Initiative Method Based on Genetic Algorithm for Task Scheduling in the Cloud Environment

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

Optimized Algorithms for Virtual Machine Placement based on Multi-Dimensional Resource Characteristics in Cloud Computing Systems

Virtual machine placement to the PMs of the cloud datacenter is one of the important problems in cloud environment to provide better service to the cloud users. This research work proposed techniques to improve the performance of virtual machine placement in cloud environment. The proposed placement algorithm consisted of two main tasks. The first task optimizes the scheduling, while the second...

متن کامل

An Effective Task Scheduling Framework for Cloud Computing using NSGA-II

Cloud computing is a model for convenient on-demand user’s access to changeable and configurable computing resources such as networks, servers, storage, applications, and services with minimal management of resources and service provider interaction. Task scheduling is regarded as a fundamental issue in cloud computing which aims at distributing the load on the different resources of a distribu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Future Generation Comp. Syst.

دوره 78 شماره

صفحات -

تاریخ انتشار 2018

An intermediate data placement algorithm for load balancing in Spark computing environment

نویسندگان

چکیده

منابع مشابه

GASA: Presentation of an Initiative Method Based on Genetic Algorithm for Task Scheduling in the Cloud Environment

GASA: Presentation of an Initiative Method Based on Genetic Algorithm for Task Scheduling in the Cloud Environment

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Optimized Algorithms for Virtual Machine Placement based on Multi-Dimensional Resource Characteristics in Cloud Computing Systems

An Effective Task Scheduling Framework for Cloud Computing using NSGA-II

عنوان ژورنال:

اشتراک گذاری