More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides 11 different data set releases extracted from the Common Crawls 2010 to 2022. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
With the Web serving as a huge worldwide data repository, issues related to data semantics (familiar to database modelers since the 1970s) have again become of paramount importance. As Web data comes from heterogeneous, possibly ...
T. Groza, S. Handschuh, K. Möller, and S. Decker. Proceedings of the 5th European Semantic Web Conference, Berlin, Heidelberg, Springer Verlag, (June 2008)
P. Lyngbaek, and V. Vianu. Proceedings of the 12th Annual ACM Conference on the Managemant of Data, page 132--142. San Francisco, California, (May 1987)