HM: A Column-Oriented MapReduce System on Hybrid Storage

HM: A Column-Oriented MapReduce System on Hybrid Storage The solid-state hybrid drive (SSHD) incorporates a small NAND flash memory into a hard drive, resulting in an integrated device with combined HDD (Hard Disk Drive) and SSD (Solid State Disk) storage. By identifying the data highly associated with the performance and buffering them in the SSD part, SSHD can deliver a better performance than the standard hard drive. However, that requires a significant redesign for existing data processing systems. In this paper, we examine the problem of efficiently processing relational data using MapReduce on a cluster using SSHDs as the underlying storage devices. We present the design of HM (Hybrid MapReduce), a column-oriented MapReduce system, which adopts different storage layout, query optimizer, data index and compression algorithm from previous MapReduce systems. In HM, the DFS (Distributed File System) is deployed on SSHDs, and data layout (how data chunks are disseminated to HDDs and SSDs) plays a key role for the performance. Hence, an approximate algorithm is used to tune the data layout adaptively to maximize the query performance.We evaluate HM using TPC-H benchmark and the results show that with our new design, the hybrid system can provide a similar performance as the SSD-only system.