Case Study of a highly automated Layout Analysis and
OCR of an incunabulum: ‘Der Heiligen Leben’ (1488)
C. Reul, M. Dittrich, and M. Gruner. Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage, (2017)
Abstract
This paper provides the first thorough documentation of a high
quality digitization process applied to an early printed book from
the incunabulum period (1450-1500). The entire OCR related
workflow including preprocessing, layout analysis and text
recognition is illustrated in detail using the example of ‘Der
Heiligen Leben’, printed in Nuremberg in 1488. For each step the
required time expenditure was recorded. The character recognition
yielded excellent results both on character (97.57%) and word
(92.19%) level. Furthermore, a comparison of a highly automated
(LAREX) and a manual (Aletheia) method for layout analysis was
performed. By considerably automating the segmentation the
required human effort was reduced significantly from over 100
hours to less than six hours, resulting in only a slight drop in OCR
accuracy. Realistic estimates for the human effort necessary for full
text extraction from incunabula can be derived from this study. The
printed pages of the complete work together with the OCR result is
available online 1 ready to be inspected and downloaded.
%0 Journal Article
%1 reul2017study
%A Reul, Christian
%A Dittrich, Marco
%A Gruner, Martin
%D 2017
%I ACM
%J Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
%K early_printed_books incunabula myown optical_character_recognition segmentation
%T Case Study of a highly automated Layout Analysis and
OCR of an incunabulum: ‘Der Heiligen Leben’ (1488)
%U https://dl.acm.org/citation.cfm?id=3078098
%X This paper provides the first thorough documentation of a high
quality digitization process applied to an early printed book from
the incunabulum period (1450-1500). The entire OCR related
workflow including preprocessing, layout analysis and text
recognition is illustrated in detail using the example of ‘Der
Heiligen Leben’, printed in Nuremberg in 1488. For each step the
required time expenditure was recorded. The character recognition
yielded excellent results both on character (97.57%) and word
(92.19%) level. Furthermore, a comparison of a highly automated
(LAREX) and a manual (Aletheia) method for layout analysis was
performed. By considerably automating the segmentation the
required human effort was reduced significantly from over 100
hours to less than six hours, resulting in only a slight drop in OCR
accuracy. Realistic estimates for the human effort necessary for full
text extraction from incunabula can be derived from this study. The
printed pages of the complete work together with the OCR result is
available online 1 ready to be inspected and downloaded.
@article{reul2017study,
abstract = {This paper provides the first thorough documentation of a high
quality digitization process applied to an early printed book from
the incunabulum period (1450-1500). The entire OCR related
workflow including preprocessing, layout analysis and text
recognition is illustrated in detail using the example of ‘Der
Heiligen Leben’, printed in Nuremberg in 1488. For each step the
required time expenditure was recorded. The character recognition
yielded excellent results both on character (97.57%) and word
(92.19%) level. Furthermore, a comparison of a highly automated
(LAREX) and a manual (Aletheia) method for layout analysis was
performed. By considerably automating the segmentation the
required human effort was reduced significantly from over 100
hours to less than six hours, resulting in only a slight drop in OCR
accuracy. Realistic estimates for the human effort necessary for full
text extraction from incunabula can be derived from this study. The
printed pages of the complete work together with the OCR result is
available online 1 ready to be inspected and downloaded.},
added-at = {2017-01-25T10:56:46.000+0100},
author = {Reul, Christian and Dittrich, Marco and Gruner, Martin},
biburl = {https://www.bibsonomy.org/bibtex/28c3a71ae1715b6137d5654c143726d7f/chreul},
interhash = {cf8a493216546087a5eef11a93de7d0f},
intrahash = {8c3a71ae1715b6137d5654c143726d7f},
journal = {Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage},
keywords = {early_printed_books incunabula myown optical_character_recognition segmentation},
publisher = {ACM},
timestamp = {2017-11-22T20:17:45.000+0100},
title = {Case Study of a highly automated Layout Analysis and
OCR of an incunabulum: ‘Der Heiligen Leben’ (1488)},
url = {https://dl.acm.org/citation.cfm?id=3078098},
year = 2017
}