Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.
G. Manku, A. Jain, и A. Sarma. WWW '07: Proceedings of the 16th international conference on World Wide Web, стр. 141--150. New York, NY, USA, ACM, (2007)
H. Bast, A. Chitea, F. Suchanek, и I. Weber. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, стр. 671--678. ACM, (2007)
J. Hosken. Papers and discussions presented at the the November 7-9, 1955, eastern joint AIEE-IRE computer conference: Computers in business and industrial systems, стр. 39--55. New York, NY, USA, ACM, (1955)
J. Ducrou, B. Vormbrock, и P. Eklund. Proceedings of the 14th International Conference on Conceptual Structures (ICCS 2006), том 4068 из Lecture Notes in Computer Science, стр. 203-214. Springer, (2006)