Harp: Collective Communication on Hadoop

Harp: Collective Communication on Hadoop Big data processing tools have evolved rapidly in recent years. MapReduce has proven very successful but is not optimized for many important analytics, especially those involving iteration. In this regard, Iterative MapReduce frameworks improve performance of MapReduce job chains through caching. Further, Pregel, Giraph and Graph Lab abstract data as a graph and process it in iterations. But all these tools are designed with a fixed data abstraction and have limited collective communication support to synchronize application data and algorithm control states among parallel processes. In this paper, we introduce a collective communication abstraction layer which provides efficient collective communication operations on several common data abstractions such as arrays, key-values and graphs, and define a Map Collective programming model which serves the diverse collective communication demands in different parallel algorithms. We implement a library called Harp to provide the features above and plug it into Hadoop so that applications abstracted in Map Collective model can be easily developed on top of MapReduce framework and conveniently integrated with other tools in Apache Big Data Stack. With improved expressiveness in the abstraction and excellent performance on the implementation, we can simultaneously support various applications from HPC to Cloud systems together with high performance.