Combining Some Preprocessing Operations and Algorithm Selection with the Help of Metalearning

Pavel Brazdil1

1LIAAD InescTec

Users of machine learning algorithms need methods that can help them to identify algorithm or their combinations (workflows) that achieve the potentially best performance. Here we focus on a special case of workflows that include combinations of some preprocessing methods and a set of classification algorithms. We have conducted two studies.

In the first study, the preprocessing operations included either a CFS feature selection or a null operation and a set of 38 classification algorithms. The aim was to identify workflows with good performance while minimizing the testing time.

The second study was oriented towards text classification, which typically involves a chain preprocessing operations, followed by application of a classification algorithm. As this problem may involve a large number of tests and it is not feasible to explore all combinations, we focused the study on a subset of all possible workflows. These include two classification algorithms, for now – SVMs, Random Forests. The preprocessing tasks include stemming, sparsity correction and stop-words removal. Different workflows were run on different datasets and ranked based on both accuracy and time.

Our study is oriented towards two algorithm selection methods: average ranking (AR) and active testing (AT), both of which exploit previous test results obtained on prior datasets. To evaluate our proposal we have carried out extensive experiments in a leave-one-out mode. The results show that both AR and AT were able to select intelligently the workflow that is likely to lead to the best results.

keywords: document classification, text mining, preprocessing, algorithm selection, metalearning.

Poster: Combining Some Preprocessing Operations and Algorithm Selection with the Help of Metalearning