@benedikt.budig

Glyph Miner: A System for Efficiently Extracting Glyphs from Early Prints in the Context of OCR.

, , and . Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, page 31--34. ACM, (2016)

Abstract

While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be "trained" specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Würzburg.

Links and resources

Tags

community