This is the public wiki for the Heritrix archival crawler project. Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix, or misspelled or mis-said as heratrix/heritix/ heretix/heratix) is an archaic word for heiress (woman who inherits).
The Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web.
More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides 11 different data set releases extracted from the Common Crawls 2010 to 2022. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
2012. Metadata Statistics for a Large Web Corpus
ABSTRACT
We provide an analysis of the adoption of metadata standards on the Web based a large crawl of the Web. In particular, we look at what forms of syntax and vocabularies publishers are using to mark up data inside HTML pages. We also describe the process that we have followed and the difficulties involved in web data extraction.
A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. Proceedings of the 3rd European Semantic Web Conference, volume 4011 of LNCS, page 411-426. Budva, Montenegro, Springer, (June 2006)
M. Shokouhi, P. Chubak, and Z. Raeesy. Information Technology: Coding and Computing, 2005. ITCC 2005. International Conference on, 2, page 503- 508 Vol. 2. (2005)
N. Pahal, N. Chauhan, and A. Sharma. Wireless Communication and Sensor Networks, 2007. WCSN '07. Third International Conference on, page 121-124. (2007)
B. Fetahu, U. Gadiraju, and S. Dietze. Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014., page 433--436. (2014)