J. Cho, and H. Garcia-Molina. Proceedings of the 11th World Wide Web conference, Honolulu, Hawaii, (2002)
Abstract
In this paper we study how we can design an effective
parallel crawler. As the size of the Web grows, it
becomes imperative to parallelize a crawling process,
in order to finish downloading pages in a reasonable
amount of time. We first propose multiple architectures
for a parallel crawler and identify fundamental issues
related to parallel crawling. Based on this
understanding, we then propose metrics to evaluate a
parallel crawler, and compare the proposed
architectures using 40 million pages collected from the
Web. Our results clarify the relative merits of each
architecture and provide a good guideline on when to
adopt which architecture.
%0 Conference Paper
%1 cho:02:parallel
%A Cho, J.
%A Garcia-Molina, H.
%B Proceedings of the 11th World Wide Web conference
%C Honolulu, Hawaii
%D 2002
%K searchengine www03 wwwbook wwwkap17
%T Parallel Crawlers
%X In this paper we study how we can design an effective
parallel crawler. As the size of the Web grows, it
becomes imperative to parallelize a crawling process,
in order to finish downloading pages in a reasonable
amount of time. We first propose multiple architectures
for a parallel crawler and identify fundamental issues
related to parallel crawling. Based on this
understanding, we then propose metrics to evaluate a
parallel crawler, and compare the proposed
architectures using 40 million pages collected from the
Web. Our results clarify the relative merits of each
architecture and provide a good guideline on when to
adopt which architecture.
@inproceedings{cho:02:parallel,
abstract = {In this paper we study how we can design an effective
parallel crawler. As the size of the Web grows, it
becomes imperative to parallelize a crawling process,
in order to finish downloading pages in a reasonable
amount of time. We first propose multiple architectures
for a parallel crawler and identify fundamental issues
related to parallel crawling. Based on this
understanding, we then propose metrics to evaluate a
parallel crawler, and compare the proposed
architectures using 40 million pages collected from the
Web. Our results clarify the relative merits of each
architecture and provide a good guideline on when to
adopt which architecture.},
added-at = {2008-12-05T15:40:10.000+0100},
address = {Honolulu, Hawaii},
author = {Cho, J. and Garcia-Molina, H.},
biburl = {https://www.bibsonomy.org/bibtex/206bb82435baf378128a57dbd3a4044ef/lysander07},
booktitle = {Proceedings of the 11th World Wide Web conference},
interhash = {1e44f12c9159f9da9df30e57a0a1c574},
intrahash = {06bb82435baf378128a57dbd3a4044ef},
keywords = {searchengine www03 wwwbook wwwkap17},
local-url = {/Users/paolo/Documents/Research/Downloads/Internet
book papers/Garcia Molina/Parallel Crawlers www
2002.pdf},
timestamp = {2009-01-27T15:24:50.000+0100},
title = {Parallel Crawlers},
year = 2002
}