Large labeled training sets are the critical building blocks of supervised
learning methods and are key enablers of deep learning techniques. For some
applications, creating labeled training sets is the most time-consuming and
expensive part of applying machine learning. We therefore propose a paradigm
for the programmatic creation of training sets called data programming in which
users provide a set of labeling functions, which are programs that
heuristically label large subsets of data points, albeit noisily. By viewing
these labeling functions as implicitly describing a generative model for this
noise, we show that we can recover the parameters of this model to "denoise"
the training set. Then, we show how to modify a discriminative loss function to
make it noise-aware. We demonstrate our method over a range of discriminative
models including logistic regression and LSTMs. We establish theoretically that
we can recover the parameters of these generative models in a handful of
settings. Experimentally, on the 2014 TAC-KBP relation extraction challenge, we
show that data programming would have obtained a winning score, and also show
that applying data programming to an LSTM model leads to a TAC-KBP score almost
6 F1 points over a supervised LSTM baseline (and into second place in the
competition). Additionally, in initial user studies we observed that data
programming may be an easier way to create machine learning models for
non-experts.
Description
Data Programming: Creating Large Training Sets, Quickly
%0 Generic
%1 ratner2016programming
%A Ratner, Alexander
%A De Sa, Christopher
%A Wu, Sen
%A Selsam, Daniel
%A Ré, Christopher
%D 2016
%K data deep_learning
%T Data Programming: Creating Large Training Sets, Quickly
%U http://arxiv.org/abs/1605.07723
%X Large labeled training sets are the critical building blocks of supervised
learning methods and are key enablers of deep learning techniques. For some
applications, creating labeled training sets is the most time-consuming and
expensive part of applying machine learning. We therefore propose a paradigm
for the programmatic creation of training sets called data programming in which
users provide a set of labeling functions, which are programs that
heuristically label large subsets of data points, albeit noisily. By viewing
these labeling functions as implicitly describing a generative model for this
noise, we show that we can recover the parameters of this model to "denoise"
the training set. Then, we show how to modify a discriminative loss function to
make it noise-aware. We demonstrate our method over a range of discriminative
models including logistic regression and LSTMs. We establish theoretically that
we can recover the parameters of these generative models in a handful of
settings. Experimentally, on the 2014 TAC-KBP relation extraction challenge, we
show that data programming would have obtained a winning score, and also show
that applying data programming to an LSTM model leads to a TAC-KBP score almost
6 F1 points over a supervised LSTM baseline (and into second place in the
competition). Additionally, in initial user studies we observed that data
programming may be an easier way to create machine learning models for
non-experts.
@misc{ratner2016programming,
abstract = {Large labeled training sets are the critical building blocks of supervised
learning methods and are key enablers of deep learning techniques. For some
applications, creating labeled training sets is the most time-consuming and
expensive part of applying machine learning. We therefore propose a paradigm
for the programmatic creation of training sets called data programming in which
users provide a set of labeling functions, which are programs that
heuristically label large subsets of data points, albeit noisily. By viewing
these labeling functions as implicitly describing a generative model for this
noise, we show that we can recover the parameters of this model to "denoise"
the training set. Then, we show how to modify a discriminative loss function to
make it noise-aware. We demonstrate our method over a range of discriminative
models including logistic regression and LSTMs. We establish theoretically that
we can recover the parameters of these generative models in a handful of
settings. Experimentally, on the 2014 TAC-KBP relation extraction challenge, we
show that data programming would have obtained a winning score, and also show
that applying data programming to an LSTM model leads to a TAC-KBP score almost
6 F1 points over a supervised LSTM baseline (and into second place in the
competition). Additionally, in initial user studies we observed that data
programming may be an easier way to create machine learning models for
non-experts.},
added-at = {2016-11-30T12:44:41.000+0100},
author = {Ratner, Alexander and De Sa, Christopher and Wu, Sen and Selsam, Daniel and Ré, Christopher},
biburl = {https://www.bibsonomy.org/bibtex/2fc3913524049786a7414f77134b5dff5/dallmann},
description = {Data Programming: Creating Large Training Sets, Quickly},
interhash = {57950bb971c095361d1baafb06bd543a},
intrahash = {fc3913524049786a7414f77134b5dff5},
keywords = {data deep_learning},
note = {cite arxiv:1605.07723},
timestamp = {2016-11-30T12:44:41.000+0100},
title = {Data Programming: Creating Large Training Sets, Quickly},
url = {http://arxiv.org/abs/1605.07723},
year = 2016
}