Abstract
A central problem in machine learning is identifying a representative set of features from
which to construct a classification model for a particular task. This thesis addresses the
problem of feature selection for machine learning through a correlation based approach.
The central hypothesis is that good feature sets contain features that are highly correlated
with the class, yet uncorrelated with each other. A feature evaluation formula, based
on ideas from test theory, provides an operational definition of this hypothesis. CFS
(Correlation based Feature Selection) is an algorithm that couples this evaluation formula
with an appropriate correlation measure and a heuristic search strategy.
CFS was evaluated by experiments on artificial and natural datasets. Three machine learn-
ing algorithms were used: C4.5 (a decision tree learner), IB1 (an instance based learner),
and naive Bayes. Experiments on artificial datasets showed that CFS quickly identifies
and screens irrelevant, redundant, and noisy features, and identifies relevant features as
long as their relevance does not strongly depend on other features. On natural domains,
CFS typically eliminated well over half the features. In most cases, classification accuracy
using the reduced feature set equaled or bettered accuracy using the complete feature set.
Feature selection degraded machine learning performance in cases where some features
were eliminated which were highly predictive of very small areas of the instance space.
Further experiments compared CFS with a wrapper—a well known approach to feature
selection that employs the target learning algorithm to evaluate feature sets. In many cases
CFS gave comparable results to the wrapper, and in general, outperformed the wrapper
on small datasets. CFS executes many times faster than the wrapper, which allows it to
scale to larger datasets.
Two methods of extending CFS to handle feature interaction are presented and exper-
imentally evaluated. The first considers pairs of features and the second incorporates
feature weights calculated by the RELIEF algorithm. Experiments on artificial domains
showed that both methods were able to identify interacting features. On natural domains,
the pairwise method gave more reliable results than using weights provided by RELIEF.
Users
Please
log in to take part in the discussion (add own reviews or comments).