Abstract
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality
webcrawler project. The Internet Archive started Heritrix development in the early part
of 2003. The intention was to develop a crawler for the specific purpose of archiving
websites and to support multiple different use cases including focused and broadcrawling.
The software is open source to encourage collaboration and joint development across
institutions with similar needs. A pluggable, extensible architecture facilitates
customization and outside contribution. Now, after over a year of development, the
Internet Archive and other institutions are using Heritrix to perform focused and
increasingly broad crawls.
Users
Please
log in to take part in the discussion (add own reviews or comments).