2012. Metadata Statistics for a Large Web Corpus
ABSTRACT
We provide an analysis of the adoption of metadata standards on the Web based a large crawl of the Web. In particular, we look at what forms of syntax and vocabularies publishers are using to mark up data inside HTML pages. We also describe the process that we have followed and the difficulties involved in web data extraction.
The Resource Description and Access (RDA) standard, due to be released this coming summer, has included since May 2007 a parallel effort to build Semantic Web enabled vocabularies. This article describes that effort and the decisions made to express the vocabularies for use within the library community and in addition as a bridge to the future of library data outside the current MARC-based systems. The authors also touch on the registration activities that have made the vocabularies usable independently of the RDA textual guidance. Designed for both human and machine users, the registered vocabularies describe the relationships between FRBR, the RDA classes and properties and the extensive value vocabularies developed for use within RDA.
CrossRef is an independent membership association, founded and directed by publishers. CrossRef’s mandate is to connect users to primary research content, by enabling publishers to work collectively. CrossRef is also the official DOI® link registration agency for scholarly and professional publications. Our citation-linking network today covers tens of millions of articles and other content items from thousands of scholarly and professional publishers.
CCSDS (Eds.) Recommendation for Standard, Consultative Committee for Space Data Systems, Office of Space Communication (Code M-3), NASA, Washington, DC 20546, USA, (May 2004)