More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides 11 different data set releases extracted from the Common Crawls 2010 to 2022. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
The Web Data Commons project extracts structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web.
The Web is designed to support flexible exploration of information by human users and by automated agents. For such exploration to be productive, information published by many different sources and for a variety of purposes must be comprehensible to a wide range of Web client software, and to users of that software.
HTTP and other Web technologies can be used to deploy resource representations that are self-describing: information about the encodings used for each representation is provided explicitly within the representation. Starting with a URI, there is a standard algorithm that a user agent can apply to retrieve and interpret such representations. Furthermore, representations can be what we refer to as grounded in the Web, by ensuring that specifications required to interpret them are determined unambiguously based on the URI, and that explicit references connect the pertinent specifications to each other. Web-grounding ensures that the specifications needed to interpret information on the Web can be identified unambiguously. When such self-describing, Web-grounded resources are linked together, the Web as a whole can support reliable, ad hoc discovery of information.
This finding describes how document formats, markup conventions, attribute values, and other data formats can be designed to facilitate the deployment of self-describing, Web-grounded Web content.
Tim Berners-Lee
Date: 2007-10-23, last change: $Date: 2021/11/01 10:16:02 $
Status: personal view only. Editing status: draft. Written in response to another round of circular discussions of web architecture.
RDFa is an extension to HTML5 that helps you markup things like People, Places, Events, Recipes and Reviews. Search Engines and Web Services use this markup to generate better search listings and give you better visibility on the Web, so that people can find your website more easily.
J. Choi, A. Khlif, и E. Epure. Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA), стр. 23--27. Online, Association for Computational Linguistics, (2020)
J. Choi, A. Khlif, и E. Epure. Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA), стр. 23--27. Online, Association for Computational Linguistics, (2020)
S. Staab, J. Lehmann, и R. Verborgh. Companion Proceedings of the The Web Conference 2018, стр. 885--886. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2018)
D. Schlör, J. Pfister, и A. Hotho. 2023 the 7th International Conference on Medical and Health Informatics (ICMHI), стр. 136–141. New York, NY, USA, Association for Computing Machinery, (2023)
D. Schlör, J. Pfister, и A. Hotho. 2023 the 7th International Conference on Medical and Health Informatics (ICMHI), стр. 136–141. New York, NY, USA, Association for Computing Machinery, (2023)
S. Staab, J. Lehmann, и R. Verborgh. Companion Proceedings of the The Web Conference 2018, стр. 885--886. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2018)
B. Cao, B. Plale, G. Subramanian, P. Missier, C. Goble, и Y. Simmhan. International Workshop on the role of Semantic Web in Provenance Management (SWPM), том 526 из CEUR Workshop Proceedings, стр. 1--6. CEUR-WS.org, (октября 2009)
V. Guizilini, R. Hou, J. Li, R. Ambrus, и A. Gaidon. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, (2020)
S. Staab, J. Lehmann, и R. Verborgh. Companion Proceedings of the The Web Conference 2018, стр. 885--886. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2018)
A. Ngonga Ngomo, F. Conrads, M. Pensel, и A. Turhan. Proceedings of the 10th International Conference on Knowledge Capture, стр. 213--221. New York, NY, USA, Association for Computing Machinery, (2019)
P. Kolyvakis, A. Kalousis, и D. Kiritsis. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), стр. 787--798. New Orleans, Louisiana, Association for Computational Linguistics, (июня 2018)
R. Türker, L. Zhang, M. Koutraki, и H. Sack. The Semantic Web - 16th International Conference, ESWC 2019, Portoroz, Slovenia, June 2-6, 2019, Proceedings, стр. 346--362. (2019)