@chwick

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification

, , , , , and . Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, (2019)

Abstract

When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For that purpose, we present a method that enables a fine-grained typography classification by training an open source OCR engine both on traditional OCR and typography recognition and show how to map the obtained typography information to the OCR recognized text output. As a test case, we used a German dictionary (Sander's "Wörterbuch der Deutschen Sprache") from the 19th century, which comprises a particularly complex semantic function of typography. Despite the very challenging material, we achieved a character error rate below 0.4% and a typography recognition that assigns the correct label to close to 99% of the words. In contrast to many existing methods, our novel approach works with real historical data and can deal with frequent typography changes even within lines.

Links and resources

Tags

community

  • @chreul
  • @chwick
@chwick's tags highlighted