Large quantities of historical newspapers are being digitized and OCRd. We describe a framework for processing the OCRd text to identify articles and extract metadata for them. We describe the article schema and provide examples of features that facilitate automatic indexing of them. For this processing, we employ lexical semantics, structural models, and community content. Furthermore, we describe visualization and summarization techniques that can be used to present the extracted events.
J. Singh, and A. Anand. Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, page 361--364. New York, NY, USA, ACM, (2017)
A. Dallmann, F. Lemmerich, D. Zoller, and A. Hotho. Proceedings of the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015, CEUR-WS.org, (2015)
R. Krestel, R. Witte, and S. Bergler. International Conference on Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria, (September 2007)
D. Rajanen, M. Salminen, and N. Ravaja. Proceedings of the 19th International Academic Mindtrek Conference (Academic MindTrek 2015), page 155--162. New York, NY, USA, ACM, (2015)
R. Kawase, E. Herder, and P. Siehndel. Proceedings of the ISWC 2014 Posters & Demonstrations Track a track within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014., page 365--368. (2014)