Inproceedings,

Evaluating dataset creation heuristics for concept detection in web pages using BERT

M. Paris, and R. Jäschke.
Proceedings of the 14th International Conference on Knowledge Science, Engineering and Management, volume 12816 of Lecture Notes in Artificial Intelligence, page 1--14. Springer, (2021)
DOI: 10.1007/978-3-030-82147-0_14

Abstract

Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web's unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models' performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals' occupation and web page domain.

BibTeX key: paris2021evaluating
entry type: inproceedings
booktitle: Proceedings of the 14th International Conference on Knowledge Science, Engineering and Management
year: 2021
pages: 1--14
publisher: Springer
series: Lecture Notes in Artificial Intelligence
volume: 12816
DOI: 10.1007/978-3-030-82147-0_14

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{paris2021evaluating, abstract = {Dataset creation for the purpose of training natural language processing (NLP) algorithms is often accompanied by an uncertainty about how the target concept is represented in the data. Extracting such data from web pages and verifying its quality is a non-trivial task, due to the Web's unstructured and heterogeneous nature and the cost of annotation. In that situation, annotation heuristics can be employed to create a dataset that captures the target concept, but in turn may lead to an unstable downstream performance. On the one hand, a trade-off exists between cost, quality, and magnitude for annotation heuristics in tasks such as classification, leading to fluctuations in trained models' performance. On the other hand, general-purpose NLP tools like BERT are now commonly used to benchmark new models on a range of tasks on static datasets. We utilize this standardization as a means to assess dataset quality, as most applications are dataset specific. In this study, we investigate and evaluate the performance of three annotation heuristics for a classification task on extracted web data using BERT. We present multiple datasets, from which the classifier shall learn to identify web pages that are centered around an individual in the academic domain. In addition, we assess the relationship between the performance of the trained classifier and the training data size. The models are further tested on out-of-domain web pages, to asses the influence of the individuals' occupation and web page domain.}, added-at = {2023-12-13T01:29:12.000+0100}, author = {Paris, Michael and Jäschke, Robert}, biburl = {https://www.bibsonomy.org/bibtex/2438b13e07dca3f091d068b28b5de2225/admin}, booktitle = {Proceedings of the 14th International Conference on Knowledge Science, Engineering and Management}, doi = {10.1007/978-3-030-82147-0_14}, interhash = {723f32bccd871bf86567d727adc126bb}, intrahash = {438b13e07dca3f091d068b28b5de2225}, keywords = {}, pages = {1--14}, publisher = {Springer}, series = {Lecture Notes in Artificial Intelligence}, timestamp = {2023-12-13T01:29:12.000+0100}, title = {Evaluating dataset creation heuristics for concept detection in web pages using BERT}, volume = 12816, year = 2021 }

BibSonomy

Evaluating dataset creation heuristics for concept detection in web pages using BERT

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on