- Ex Libris - DigiTool multi-page entity
- DL Consulting Blog
- hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this in...hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.
- The purpose of this document is to define an open standard for representing OCR results. The goal is to reuse as much existing technology as possible, and ...The purpose of this document is to define an open standard for representing OCR results. The goal is to reuse as much existing technology as possible, and to arrive at a representation that makes it easy to reuse OCR results.
- OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natur...OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
- Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extra...Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
- Generates a METS file connecting image areas, OCRed text and ground truth documents encoded in TEI xml.
- METS / ALTO technical information
- The National Library of Australia, in collaboration with the Australian State and Territory libraries, are creating a free online service that gives full-t...The National Library of Australia, in collaboration with the Australian State and Territory libraries, are creating a free online service that gives full-text searching of newspaper articles. This will include newspapers published in each state and territory from the 1800s to the mid-1950s, when copyright applies. The first Australian newspaper, published in Sydney in 1803, is included in the program. The Beta service contains 70,000 newspaper pages from 1803 onwards and additional pages are being added each week.
- METS / ALTOgeneral information
- Digital Library Consulting Blog
- Powerpoint Präsentation von 2004 zu METAe, METS und ALTO
- ALTO Schema
- The National Library of Australia, in collaboration the Australian State and Territory libraries, has commenced a program to digitise out of copyright news...The National Library of Australia, in collaboration the Australian State and Territory libraries, has commenced a program to digitise out of copyright newspapers.
- ALTO (Analyzed Layout and Text Object) is an extension schema to METS, describing the layout and content of e.g. single pages.


user