Inproceedings,

Evaluating dataset creation heuristics for concept detection in web pages using BERT

, and .
Proceedings of the 14th International Conference on Knowledge Science, Engineering and Management, volume 12816 of Lecture Notes in Artificial Intelligence, page 1--14. Springer, (2021)
DOI: 10.1007/978-3-030-82147-0_14

Abstract

Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web's unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models' performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals' occupation and web page domain.

Tags

Users

  • @jaeschke
  • @tobias.koopmann
  • @dblp

Comments and Reviews