Inproceedings,

How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web

M. Paris, and R. Jäschke.
Proceedings of the 31st ACM Conference on Hypertext and Social Media, New York, NY, USA, ACM, (2020)
DOI: 10.1145/3372923.3404836

Abstract

Longitudinal web archives can be a foundation for investigating structural and content-based research questions. One prerequisite is that they contain a faithful representation of the relevant subset of the web. Therefore, an assessment of the authority of a given data set with respect to a research question should precede the actual investigation. Next to proper creation and curation, this requires measures for estimating the potential of a longitudinal web archive to yield information about the central objects the research question aims to investigate. In particular, content-based research questions often lack the ab-initio confidence about the integrity of the data. In this paper we focus on one specifically important aspect, namely the exhaustiveness of the data set with respect to the central objects. Therefore, we investigate the recall coverage of researcher names in a longitudinal academic web crawl over a seven year period and the influence of our crawl method on the data set integrity. Additionally, we propose a method to estimate the amount of missing information as a means to describe the exhaustiveness of the crawl and motivate a use case for the presented corpus.

BibTeX key: paris2020exhaustiveness
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 31st ACM Conference on Hypertext and Social Media
year: 2020
publisher: ACM
series: HT ’20
venue: Virtual Event
eventdate: July 13–15, 2020
DOI: 10.1145/3372923.3404836

BibSonomy

How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on