Аннотация
Crowdsourcing has been the prevalent paradigm for creating natural language
understanding datasets in recent years. A common crowdsourcing practice is to
recruit a small number of high-quality workers, and have them massively
generate examples. Having only a few workers generate the majority of examples
raises concerns about data diversity, especially when workers freely generate
sentences. In this paper, we perform a series of experiments showing these
concerns are evident in three recent NLP datasets. We show that model
performance improves when training with annotator identifiers as features, and
that models are able to recognize the most productive annotators. Moreover, we
show that often models do not generalize well to examples from annotators that
did not contribute to the training set. Our findings suggest that annotator
bias should be monitored during dataset creation, and that test set annotators
should be disjoint from training set annotators.
Пользователи данного ресурса
Пожалуйста,
войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)