Users of machine learning algorithms need methods that can help them to identify algorithm or their combinations (workflows) that achieve the potentially best performance. Here we focus on a special case of workflows that include combinations of some preprocessing methods and a set of classification algorithms. We have conducted two studies.
In the first study, the preprocessing operations included either a CFS feature selection or a null operation and a set of 38 classification algorithms. The aim was to identify workflows with good performance while minimizing the testing time.
The second study was oriented towards text classification, which typically involves a chain preprocessing operations, followed by application of a classification algorithm. As this problem may involve a large number of tests and it is not feasible to explore all combinations, we focused the study on a subset of all possible workflows. These include two classification algorithms, for now – SVMs, Random Forests. The preprocessing tasks include stemming, sparsity correction and stop-words removal. Different workflows were run on different datasets and ranked based on both accuracy and time.
Our study is oriented towards two algorithm selection methods: average ranking (AR) and active testing (AT), both of which exploit previous test results obtained on prior datasets. To evaluate our proposal we have carried out extensive experiments in a leave-one-out mode. The results show that both AR and AT were able to select intelligently the workflow that is likely to lead to the best results.
keywords: document classification, text mining, preprocessing, algorithm selection, metalearning.