Abstract
The Heritrix web crawler aims to be the world's first open source,
extensible, web-scale, archival-quality web crawler. It has however been
limited in its crawling strategies to snapshot crawling. This paper reports on
work to add the ability to conduct incremental crawls to its capabilities. We
first discuss the concept of incremental crawling as opposed to snapshot
crawling and then the possible ways to design an effective incremental strategy.
An overview is given of the implementation that we did, its limits and strengths
are discussed. We then report on the results of initial experimentation with the
new software which have gone well. Finally, we discuss issues that remain
unresolved and possible future improvements.
Users
Please
log in to take part in the discussion (add own reviews or comments).