2012. Metadata Statistics for a Large Web Corpus
ABSTRACT
We provide an analysis of the adoption of metadata standards on the Web based a large crawl of the Web. In particular, we look at what forms of syntax and vocabularies publishers are using to mark up data inside HTML pages. We also describe the process that we have followed and the difficulties involved in web data extraction.
The Publishing Requirements for Industry Standard Metadata (PRISM) specification defines a standard for interoperable content description, interchange, and reuse in both traditional and electronic publishing contexts. PRISM recommends the use of certain existing standards, such as XML, RDF, the Dublin Core, and various ISO specifications for locations, languages, and date/time formats. Beyond those recommendations, it defines a small number of XML namespaces and controlled vocabularies of values, in order to meet the goals listed above.
ANSI/NISO Z39.50 (=ISO 23950: "Information Retrieval (Z39.50): Application Service Definition and Protocol Specification") specifies a client/server-based protocol for
searching and retrieving information from remote databases.
CCSDS (Eds.) Recommendation for Standard, Consultative Committee for Space Data Systems, Office of Space Communication (Code M-3), NASA, Washington, DC 20546, USA, (мая 2004)