In recent years, the real-world impact of machine learning (ML) has grown in leaps and bounds. In large part, this is due to the advent of deep learning models, which allow practitioners to get state-of-the-art scores on benchmark datasets without any hand-engineered features. Given the availability of multiple open-source ML frameworks like TensorFlow and PyTorch, and an abundance of available state-of-the-art models, it can be argued that high-quality ML models are almost a commoditized resource now. There is a hidden catch, however: the reliance of these models on massive sets of hand-labeled training data.
These hand-labeled training sets are expensive and time-consuming to create — often requiring person-months or years to assemble, clean, and debug — especially when domain expertise is required. On top of this, tasks often change and evolve in the real world. For example, labeling guidelines, granularities, or downstream use cases often change, necessitating re-labeling (e.g., instead of classifying reviews only as positive or negative, introducing a neutral category). For all these reasons, practitioners have increasingly been turning to weaker forms of supervision, such as heuristically generating training data with external knowledge bases, patterns/rules, or other classifiers. Essentially, these are all ways of programmatically generating training data—or, more succinctly, programming training data.
We begin by reviewing areas of ML that are motivated by the problem of labeling training data, and then describe our research on modeling and integrating a diverse set of supervision sources. We also discuss our vision for building data management systems for the massively multi-task regime with tens or hundreds of weakly supervised dynamic tasks interacting in complex and varied ways. Check out the our research blog for detailed discussions of these topics and more!