More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides 11 different data set releases extracted from the Common Crawls 2010 to 2022. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
P. Heinisch, A. Dulny, A. Krause, и A. Hotho. Workshop on Neuro-Explicit AI and Expert-Informed Machine Learning for Engineering and Physical Sciences at the ECML PKDD 2023
, (2023)cite arxiv:2306.14511.