Optimising Bootstrapping Algorithms Using R and Hadoop

Optimising Bootstrapping Algorithms Using R and Hadoop A key research problem in machine learning and statistics today is feature or variable selection when the number of samples is relatively small to the number of features. Resampling methods such as the Bootstrap are used in this context to mimic the availability of multiple datasets by resampling from the same unique dataset. On one hand, some algorithms based on resampling, such as Bolas so, have been shown to decrease error as the number of bootstrap replicates increases. On the other hand, we expect an increase in dataset size in most of research domains. Therefore there is a demand for a large number of algorithm runs on several data replicates, and with the expected increase in dataset sizes, high performance parallel optimisation becomes mandatory. In this paper, we introduce an efficient data distribution and load balanced parallel calculation for the Bolas so algorithm based on R and HDFS. We study the performance on a large dataset consisting of 300 samples and 10000 features. The performance evaluation found that the new R on HDFS and its implementation in Snowfall and RHDFS outperforms the conventional algorithm with Linux EXT4. We conclude that R on HDFS holds great promise for methods based on resampling or bootstrapping, in particular when the number of dataset replications decreases the algorithm error, such as we demonstrated in the performance evaluation of this paper.