raleighpublicrecord/dochive · GitHub, DocHive has 2 prerequisites, ImageMagic and Tesserac. coverts pdf pages to images and the OCRs the image. purpose is to extract numeric statistical tables in PDFs for import into spreadsheets.
OCRopus is an OCR system written in Python, NumPy, and SciPy focusing on the use of large scale machine learning for addressing problems in document analysis. Formerly Tesseract.
IMPACT is a Centre of Competence that makes digitisation of historical printed text in Europe faster, cheaper and better, and provides tools, services and facilities for further advancement of the State of the Art in this field.
ALTO (Analyzed Layout and Text Object) is a XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper. It most commonly serves as an extension schema used within the Metadata Encoding and Transmission Schema (METS) administrative metadata section. However, ALTO instances can also exist as a standalone document used independently of METS.
Chronicling America provides bulk access to its OCR data. Each file will decompress into directory structure that lets you easily map the OCR file to the URL identifier for that page. Historic American Newspapers