Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. For more information about Tika, please see the list of supported document formats and the available documentation . You can find the latest release on the download page . See the Getting Started guide for instructions on how to start using Tika.
Tika is a subproject of Apache Lucene . Lucene is a project of the Apache Software Foundation .
Text mining and web scraping involves chunk parsing and recognition of named entities (institutions, dates, titles)...The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) wit
Text mining and web scraping involves chunk parsing and recognition of named entities (institutions, dates, titles)...The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) wit
J. Illig, B. Roth, and D. Klakow. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, page 100--105. Gothenburg, Sweden, Association for Computational Linguistics, (April 2014)
J. Illig, B. Roth, and D. Klakow. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, page 100--105. Gothenburg, Sweden, Association for Computational Linguistics, (April 2014)
G. Sautter, and K. Böhm. Proceedings of the Second International Conference on Theory and Practice of Digital Libraries, page 370--382. Berlin/Heidelberg, Springer, (2012)
M. Romanello, M. Berti, A. Babeu, and G. Crane. HT '09: Proceedings of the Twentieth ACM Conference on Hypertext and Hypermedia, New York, NY, USA, ACM, (July 2009)