Abstract
A common issue for classification in scientific research and industry is the
existence of imbalanced classes. When sample sizes of different classes are
imbalanced in training data, naively implementing a classification method often
leads to unsatisfactory prediction results on test data. Multiple resampling
techniques have been proposed to address the class imbalance issues. Yet, there
is no general guidance on when to use each technique. In this article, we
provide an objective-oriented review of the common resampling techniques for
binary classification under imbalanced class sizes. The learning objectives we
consider include the classical paradigm that minimizes the overall
classification error, the cost-sensitive learning paradigm that minimizes a
cost-adjusted weighted type I and type II errors, and the Neyman-Pearson
paradigm that minimizes the type II error subject to a type I error constraint.
Under each paradigm, we investigate the combination of the resampling
techniques and a few state-of-the-art classification methods. For each pair of
resampling techniques and classification methods, we use simulation studies to
study the performance under different evaluation metrics. From these extensive
simulation experiments, we demonstrate under each classification paradigm, the
complex dynamics among resampling techniques, base classification methods,
evaluation metrics, and imbalance ratios. For practitioners, the take-away
message is that with imbalanced data, one usually should consider all the
combinations of resampling techniques and the base classification methods.
Users
Please
log in to take part in the discussion (add own reviews or comments).