Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
Das SBB Zeitungen METS-Profil - Exchange beschreibt das Datenformat für den Austausch von Metadaten für digitale Objekte digitalisierter Zeitungen zwischen der Staatsbibliothek zu Berlin und Dritten, die als Auftragnehmer diese Daten erstellen.
The <div> TYPE attribute vocabulary is a list of terms that may be used to categorise the core structural elements of an object in a METS document conforming to the Australian METS Profile. Examples of how these values may be applied are given in the Appendix – Content Models. The content model in the current version of the document represent use cases that have been tested by the Maintenance Agency, and further content models and vocabulary terms will be added as they are developed.
P. Mika. Proceedings of the Workshop on Semantic Search (SemSearch 2008) at the 5th European Semantic Web Conference (ESWC 2008) , June 2, 2008, Tenerife, Spain, volume 334 of CEUR Workshop Proceedings, CEUR-WS.org, (2008)