OCRopus(tm) is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
The DjVuLibre XML Tools provide for editing the metadata, hyperlinks and hidden text associated with DjVu files. Unlike djvused(1) the DjVuLibre XML Tools rely on the XML technology and can take advantage of XML editors and verifiers.
Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
The National Library of Australia, in collaboration with the Australian State and Territory libraries, are creating a free online service that gives full-text searching of newspaper articles. This will include newspapers published in each state and territory from the 1800s to the mid-1950s, when copyright applies. The first Australian newspaper, published in Sydney in 1803, is included in the program. The Beta service contains 70,000 newspaper pages from 1803 onwards and additional pages are being added each week.