Introduction to heritrix, an archival quality web crawler

Abstract

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality webcrawler project. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases including focused and broadcrawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, extensible architecture facilitates customization and outside contribution. Now, after over a year of development, the Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls.

BibTeX key: mohr2004introduction
entry type: inproceedings
address: Bath, UK
booktitle: Proceedings of the 4th International Web Archiving Workshop IWAW'04
year: 2004
month: jul
Document: http://crawler.archive.org/Mohr-et-al-2004.pdf

BibSonomy

Introduction to heritrix, an archival quality web crawler

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on