Analyzing the Web: Are Top Websites Lists a Good Choice for Research?
T. Alby, und R. Jäschke. Proceedings of the International Conference on Theory and Practice of Digital Libraries, Seite 11--25. Cham, Springer, (2022)
DOI: 10.1007/978-3-031-16802-4_2
Zusammenfassung
The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.
%0 Conference Paper
%1 alby2022analyzing
%A Alby, Tom
%A Jäschke, Robert
%B Proceedings of the International Conference on Theory and Practice of Digital Libraries
%C Cham
%D 2022
%I Springer
%K 2022 alexa archive commoncrawl crawl myown research science tpdl web
%P 11--25
%R 10.1007/978-3-031-16802-4_2
%T Analyzing the Web: Are Top Websites Lists a Good Choice for Research?
%U https://link.springer.com/chapter/10.1007/978-3-031-16802-4_2
%X The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.
@inproceedings{alby2022analyzing,
abstract = {The web has been a subject of research since its beginning, but it is difficult if not impossible to analyze the whole web, even if a database of all URLs would be freely accessible. Hundreds of studies have used commercial top websites lists as a shortcut, in particular the Alexa One Million Top Sites list. However, apart from the fact that Amazon decided to terminate Alexa, we question the usefulness of such lists for research as they have several shortcomings. Our analysis shows that top sites lists miss frequently visited websites and offer only little value for language-specific research. We present a heuristic-driven alternative based on the Common Crawl host-level web graph while also taking language-specific requirements into account.},
added-at = {2022-07-09T10:18:02.000+0200},
address = {Cham},
author = {Alby, Tom and Jäschke, Robert},
biburl = {https://www.bibsonomy.org/bibtex/204c0c55f66d6b4ce9ec17318c4b4e70e/jaeschke},
booktitle = {Proceedings of the International Conference on Theory and Practice of Digital Libraries},
doi = {10.1007/978-3-031-16802-4_2},
interhash = {7fb83f826e70519ac62cd6a0fccc140c},
intrahash = {04c0c55f66d6b4ce9ec17318c4b4e70e},
keywords = {2022 alexa archive commoncrawl crawl myown research science tpdl web},
pages = {11--25},
publisher = {Springer},
series = {TPDL '22},
timestamp = {2023-01-11T15:08:29.000+0100},
title = {Analyzing the Web: Are Top Websites Lists a Good Choice for Research?},
url = {https://link.springer.com/chapter/10.1007/978-3-031-16802-4_2},
year = 2022
}