Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
In these Web 2.0-3.0 days, there is a lot of expectation for data publishers to offer their data through APIs, but there is no clear way to encode and query this data in a universal way. There are many ways of encoding information structure and semantics
This document describes the differences between XML and RDF, and between DC elements, XML elements and RDF properties. It seeks to clarify the requirements that must be met before a "term" can be referenced in a Dublin Core Application Profile (DCAP).
vocabulary to describe a general data model for scholarly citations. It covers three primary classes: events, agents, and bibliographic reference types. It is designed to offer a solid general relational model for citation metadata, and also to provide a
As an XML schema, the "Metadata Object Description Schema" (MODS) is intended to be able to carry selected data from existing MARC 21 records as well as to enable the creation of original resource description records.
Mit MARCXML - einem XML-Schema für die Eins-zu-Eins-Übertragung von MARC-21-Datensätzen in eine XML-Struktur - hat die Library of Congress auf diese Entwicklung reagiert. MABxml soll eine vergleichbare Funktion für MAB2 erfüllen.