A Data Skew Oriented Reduce Placement Algorithm Based on Sampling
نویسندگان
چکیده
For frequent disk I/O and large data transmissions among different racks and physical nodes, intermediate data communication has become the most important performance bottle-neck in most running Hadoop systems. This paper proposes a reduce placement algorithm called CORP to schedule related map and reduce tasks on the near nodes of clusters or racks for data locality. Because the number of keys cannot be counted until the input data are processed by map tasks, this paper applies a reservoir algorithm for sampling the input data, which can bring the distribution of keys/values closer to the overall situation of original data. Based on the distribution matrix of the intermediate results in each partition, by calculating the distance and cost matrices among the cross node communication, the related map and reduce tasks can be scheduled to relatively nearby physical nodes for data locality. We implement CORP in Hadoop 2.4.0 and evaluate its performance using three widely used benchmarks: Sort, Grep, and Join. In these experiments, an evaluation model is proposed for selecting the appropriate sample rates, which can comprehensively consider the importance of cost, effect, and variance in sampling. Experimental results show that CORP can not only improve the balance of reduces tasks effectively but also decreases the job execution time for the lower inner data communication. Compared with some other reduce scheduling algorithms, the average data transmission of the entire system on the core switch has been reduced substantially.
منابع مشابه
Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملAn intermediate data placement algorithm for load balancing in Spark computing environment
Since MapReduce became an effective and popular programming framework for parallel data processing, key skew in intermediate data has become one of the important system performance bottlenecks. For solving the load imbalance of bucket containers in the shuffle process of the Spark computing framework, this paper proposes a splitting and combination algorithm for skew intermediate data blocks (S...
متن کاملAn Extension of the Birnbaum-Saunders Distribution Based on Skew-Normal t Distribution
In this paper, we introducte a family of univariate Birnbaum-Saunders distributions arising from the skew-normal-t distribution. We obtain several properties of this distribution such as its moments, the maximum likelihood estimation procedure via an EM-algorithm and a method to evaluate standard errors using the EM-algorithm. Finally, we apply these methods to a real data set to demonstr...
متن کاملMulti Objective Optimization Placement of DG Problem for Different Load Levels on Distribution Systems with Purpose Reduction Loss, Cost and Improving Voltage Profile Based on DAPSO Algorithm
Along with economic growth of countries which leads to their increased energy requirements,the problem of power quality and reliability of the networks have been more considered andin recent decades, we witnessed a noticeable growing trend of distributed generation sources(DG) in distribution networks. Occurrence of DG in distribution systems, in addition tochanging the utilization of these sys...
متن کاملA Routing-Aware Simulated Annealing-based Placement Method in Wireless Network on Chips
Wireless network on chip (WiNoC) is one of the promising on-chip interconnection networks for on-chip system architectures. In addition to wired links, these architectures also use wireless links. Using these wireless links makes packets reach destination nodes faster and with less power consumption. These wireless links are provided by wireless interfaces in wireless routers. The WiNoC archite...
متن کامل