AIR: Adaptive Index Replacement in Hadoop

AIR: Adaptive Index Replacement in Hadoop The Hadoop Distributed Filesystem has become the de-facto standard for storing large datasets in data management systems such as Hadoop MapReduce, Hive, and Stratosphere. Though HDFS was originally designed to support scan-oriented operations, recently several techniques for HDFS have been developed to allow for efficient indexing. One of these indexing techniques is aggressive indexing, i.e. HDFS replicas are immediately indexed at upload time before touching any disk – creating multiple clustered indexes almost for free on the way. A second technique is adaptive indexing, i.e. HDFS blocks are only indexed on demand as a side effect of query processing. Though these techniques provide impressive speed-ups in terms of query processing, they totally ignored the costs involved with storing a large number of replicas of a particular dataset. The HDFS-variants of adaptive indexing were already designed to leverage the natural redundancy that comes with HDFS, typically storing a dataset three times anyway. However, it is questionable whether storing an unlimited number of replicas for a dataset is a practical solution. Therefore, this paper is the first to analyze adaptive indexing under a space constraint, i.e. we assume that indexes are adaptively created and deleted. We coin this problem the Adaptive Index Replacement problem. We present a new algorithm to solve the online AIR problem called LeastExpectedBenefit-K and compare it with several existing state-of-the-art online Index Selection algorithms. We present a comprehensive study evaluating ten different algorithms. Our results show that our algorithm LEB-2 is efficient and robust and a good choice in practice.