G. Hulten, L. Spencer, and P. Domingos. Proceedings of the 7th ACM SIGKDD International conference on Knowledge discovery and data mining, page 97--106. ACM Press, (2001)
Abstract
Most statistical and machine-learning algorithms assume that the data
is a random sample drawn from a stationary
distribution. Unfortunately, most of the large databases available for
mining today violate this assumption. They were gathered over months
or years, and the underlying processes generating them changed during
this time, sometimes radically. Although a number of algorithms have
been proposed for learning time-changing concepts, they generally do
not scale well to very large databases. In this paper we propose an
efficient algorithm for mining decision trees from
continuously-changing data streams, based on the ultra-fast VFDT
decision tree learner. This algorithm, called CVFDT, stays current
while making the most of old data by growing an alternative subtree
whenever an old one becomes questionable, and replacing the old with
the new when the new becomes more accurate. CVFDT learns a model which
is similar in accuracy to the one that would be learned by reapplying
VFDT to a moving window of examples every time a new example arrives,
but with O(1) complexity per example, as opposed to O(w), where w is
the size of the window. Experiments on a set of large time-changing
data streams demonstrate the utility of this approach.
%0 Conference Paper
%1 KDD01:Hulten
%A Hulten, Geoff
%A Spencer, Laurie
%A Domingos, Pedro
%B Proceedings of the 7th ACM SIGKDD International conference on Knowledge discovery and data mining
%D 2001
%I ACM Press
%K imported
%P 97--106
%T Mining time-changing data streams
%X Most statistical and machine-learning algorithms assume that the data
is a random sample drawn from a stationary
distribution. Unfortunately, most of the large databases available for
mining today violate this assumption. They were gathered over months
or years, and the underlying processes generating them changed during
this time, sometimes radically. Although a number of algorithms have
been proposed for learning time-changing concepts, they generally do
not scale well to very large databases. In this paper we propose an
efficient algorithm for mining decision trees from
continuously-changing data streams, based on the ultra-fast VFDT
decision tree learner. This algorithm, called CVFDT, stays current
while making the most of old data by growing an alternative subtree
whenever an old one becomes questionable, and replacing the old with
the new when the new becomes more accurate. CVFDT learns a model which
is similar in accuracy to the one that would be learned by reapplying
VFDT to a moving window of examples every time a new example arrives,
but with O(1) complexity per example, as opposed to O(w), where w is
the size of the window. Experiments on a set of large time-changing
data streams demonstrate the utility of this approach.
%@ 1-58113-391-X
@inproceedings{KDD01:Hulten,
abstract = {
Most statistical and machine-learning algorithms assume that the data
is a random sample drawn from a stationary
distribution. Unfortunately, most of the large databases available for
mining today violate this assumption. They were gathered over months
or years, and the underlying processes generating them changed during
this time, sometimes radically. Although a number of algorithms have
been proposed for learning time-changing concepts, they generally do
not scale well to very large databases. In this paper we propose an
efficient algorithm for mining decision trees from
continuously-changing data streams, based on the ultra-fast VFDT
decision tree learner. This algorithm, called CVFDT, stays current
while making the most of old data by growing an alternative subtree
whenever an old one becomes questionable, and replacing the old with
the new when the new becomes more accurate. CVFDT learns a model which
is similar in accuracy to the one that would be learned by reapplying
VFDT to a moving window of examples every time a new example arrives,
but with O(1) complexity per example, as opposed to O(w), where w is
the size of the window. Experiments on a set of large time-changing
data streams demonstrate the utility of this approach.},
added-at = {2008-04-30T12:59:47.000+0200},
author = {Hulten, Geoff and Spencer, Laurie and Domingos, Pedro},
biburl = {https://www.bibsonomy.org/bibtex/2d1f61cf8ec12ac2e734ea7feb3e91611/kdubiq},
booktitle = {Proceedings of the 7th ACM SIGKDD International conference on Knowledge discovery and data mining},
description = {KDubiq Blueprint},
interhash = {6f9e0b7ff453e9a74a64ce8e96b880e5},
intrahash = {d1f61cf8ec12ac2e734ea7feb3e91611},
isbn = {1-58113-391-X},
keywords = {imported},
location = {San Francisco, California},
pages = {97--106},
publisher = {ACM Press},
timestamp = {2008-04-30T12:59:56.000+0200},
title = {Mining time-changing data streams},
year = 2001
}