Preprocessing web logs: A critical phase in web usage mining

Preprocessing web logs: A critical phase in web usage mining Web usage mining refers to finding out user access patterns from the web logs of a website. The weblogs obtained are highly unstructured and this very nature of web logs makes them unsuitable formining directly. Hence they go through a stage called preprocessing which not only makes them suitable for analysis but reduces the file size significantly. This paper explores this preprocessing phase in detail and proposes a total and absolute tool for the same which reduces the irrelevant and noisy data and transforms it into a form so that it can be readily used for analysis. The tool has been referred to as total and absolute as after cleaning the data it shows us a summary statistics of the records at the end once they have been preprocessed. The summary statistics highlights the number of records fed as input, elements obtained after carrying out preprocessing and the time utilized in accomplishing the task. Finally it exports the preprocessed data obtained into a .log file which can be very easily imported in any data mining utility. The features of summary statistics and export data can be considered as a distinguishing feature from the other tools which have been proposed earlier.