Phase–Reconfigurable Shuffle Optimization for Hadoop MapReduce

Phase–Reconfigurable Shuffle Optimization for Hadoop MapReduce Hadoop MapReduce is a leading open source framework that supports the realization of the Big Data revolution and serves as a pioneering platform in ultra large amount of information storing and processing. However, tuning a MapReduce system has become a difficult work because a large number of parameters restrict its performance, many of which are related with shuffle, a complicated phase between map and reduce functions, including sorting, grouping, and HTTP transferring. During shuffle phase, a large mount of time is consumed on disk I/O with a low speed of data throughput. In this paper, we build a mathematical model to judge the computing complexities with the different operating orders within map-side shuffle, so that a faster execution can be achieved through reconfiguring the order of sorting and grouping. Furthermore, a 3-dimension exploring space of the performance is expanded, with which, some sampled features during shuffle stage, such as key number, spilling file number, and the variances of intermediate results, are collected to support the evaluation of computing complexities of each operating order. Thus, an optimized reconfiguration of map-side shuffle architecture can be achieved within Hadoop without extra disk I/O induced. Comparing with the original Hadoop implementation, the results show that our reconfigurable architecture gains up to 2.37X speedup to finish mapside shuffle work.