Inproceedings,

How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web

, and .
Proceedings of the 31st ACM Conference on Hypertext and Social Media, New York, NY, USA, ACM, (2020)
DOI: 10.1145/3372923.3404836

Abstract

Longitudinal web archives can be a foundation for investigating structural and content-based research questions. One prerequisite is that they contain a faithful representation of the relevant subset of the web. Therefore, an assessment of the authority of a given data set with respect to a research question should precede the actual investigation. Next to proper creation and curation, this requires measures for estimating the potential of a longitudinal web archive to yield information about the central objects the research question aims to investigate. In particular, content-based research questions often lack the ab-initio confidence about the integrity of the data. In this paper we focus on one specifically important aspect, namely the exhaustiveness of the data set with respect to the central objects. Therefore, we investigate the recall coverage of researcher names in a longitudinal academic web crawl over a seven year period and the influence of our crawl method on the data set integrity. Additionally, we propose a method to estimate the amount of missing information as a means to describe the exhaustiveness of the crawl and motivate a use case for the presented corpus.

Tags

Users

  • @jaeschke
  • @dblp

Comments and Reviews