Heterogeneous Ensembles for the Missing Feature Problem

[abstract] Missing values are ubiquitous in real-world datasets. In this work, we show how to handle them with heterogeneous ensembles of classifiers that outperform state-of-the-art solutions. Several approaches are compared using several different datasets. Some state-of-the-art classifiers (e.g. SVM and RotBoost) are first tested, coupled with the Expectation-Maximization (EM) imputation method. Then, the classifiers are combined to build ensembles. Using the Wilcoxon signed-rank test (reject the null hypothesis, level of significance 0.05) we show that our best heterogeneous ensembles, obtained by combining a forest of decision trees (a method that does not require any dataset-specific tuning) with a tuned SVM, outperforms two dataset-tuned solutions based on LibSVM, the most used SVM toolbox in the world: a stand-alone SVM classifier and a random subspace of SVMs. The same ensembles also exhibit better performance than a recent cluster-based imputation method for handling missing values - which has been shown [27] to outperform several state-of-the-art imputation approaches - when both the training set and the test set contain 10% of missing values.

Keywords: missing values; imputation methods; support vector machine; decision tree; ensemble of classifiers.

[Full Paper]