Abstract

The low cost of digital communication has given rise to the problem of email spam, which is unwanted, harmful, or abusive electronic content. In this thesis, we present several advances in the application of online machine learning methods for automatically filtering spam. We detail a sliding-window variant of Support Vector Machines that yields state of the art results for the standard online filtering task.We explore a variety of feature representations for spam data. We reduce human labeling cost through the use of efficient online active learning variants. We give practical solutions to the one-sided feedback scenario, in which users only give labeling feedback on messages predicted to be non-spam. We investigate the impact of class label noise on machine learning-based spam filters, showing that previous benchmark evaluations rewarded filters prone to overfitting in real-world settings and proposing several modifications for combating these negative effects. Finally,we investigate the performance of these filtering methods on the more challenging task of abuse filtering in blog comments. Together, these contributions enable more accurate spam filters to be deployed in real-world settings, with greater robustness to noise, lower computation cost and lower human labeling cost. ****** Active learning methods seek to reduce the number of labeled examples needed to train an effective classifier, and have natural appeal in spam filtering applications where trustworthy labels for messages may be costly to acquire. Past investigations of active learning in spam filtering have focused on the pool-based scenario, where there is assumed to be a large, unlabeled data set and the goal is to iteratively identify the best subset of examples for which to request labels. However, even with optimizations this is a costly approach. We investigate an online active learning scenario where the filter is exposed to a stream of messages which must be classified one at a time. The filter may only request a label for a given message immediately after it has been classified. The goal is to achieve strong online classification performance with few label requests. This is a novel scenario for low-cost active spam filtering, fitting for application in large-scale systems. We draw from the label efficient machine learning literature to investigate several approaches to selective sampling in this scenario using linear classifiers. We show that online active learning can dramatically reduce labeling and training costs with negligible additional overhead while maintaining high levels of classification performance.

Links and resources

Tags