What would be a good way to extract headlines, dates, and authors from news articles? It seems easy to write a scraper using xpath or similar to extract this information from a single site, but I'm not sure of a more scalable solution if you're extracting from say 10,000 sites.
TeSSI® (Terminology Supported Semantic Indexing) is a state-of-the-art tool that improves upon the existing search and retrieval tools by extracting the meaning out of medical free text and placing the resulting medical ‘concepts’ in the document ind
TeSSI® (Terminology Supported Semantic Indexing) is a state-of-the-art tool that improves upon the existing search and retrieval tools by extracting the meaning out of medical free text and placing the resulting medical ‘concepts’ in the document...
Although term extraction has been researched for more than 20 years, only a few studies focus on under-resourced languages. Moreover, bilingual term mapping from comparable corpora for these languages has attracted researchers only recently. This paper presents methods for term extraction, term tagging in documents, and bilingual term mapping from comparable corpora for four under-resourced languages: Croatian, Latvian, Lithuanian, and Romanian. Methods described in this paper are language independent as long as language specific parameter data is provided by the user and the user has access to a part of speech or a morpho-syntactic tagger.
Text mining and web scraping involves chunk parsing and recognition of named entities (institutions, dates, titles)...The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) wit
Text mining and web scraping involves chunk parsing and recognition of named entities (institutions, dates, titles)...The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) wit
This is the project page for SecondString, an open-source Java-based package of approximate string-matching techniques. This code was developed by researchers at Carnegie Mellon University from the Center for Automated Learning and Discovery, the Department of Statistics, and the Center for Computer and Communications Security.
SecondString is intended primarily for researchers in information integration and other scientists. It does or will include a range of string-matching methods from a variety of communities, including statistics, artificial intelligence, information retrieval, and databases. It also includes tools for systematically evaluating performance on test data. It is not designed for use on very large data sets.
The main task of the GenIELex project is the development of a biochemistry specific lexicon as well as of an annotated corpus for the evaluation of the system. The need for the construction of such a lexicon is illustrated by the following figures, based
The main task of the GenIELex project is the development of a biochemistry specific lexicon as well as of an annotated corpus for the evaluation of the system. The need for the construction of such a lexicon is illustrated by the following figures, based
J. Wermter, and U. Hahn. 44th Annual Meeting of the Association for Computational Linguistics, page 785--792. Sydney, Australia, Association for Computational Linguistics, (July 2006)
R. Mihalcea, and A. Csomai. Proceedings of the sixteenth ACM Conference on information and knowledge management, page 233--242. New York, NY, USA, ACM, (2007)
M. Romanello, M. Berti, A. Babeu, and G. Crane. HT '09: Proceedings of the Twentieth ACM Conference on Hypertext and Hypermedia, New York, NY, USA, ACM, (July 2009)
S. Auer, and J. Lehmann. ESWC '07: Proceedings of the 4th European conference on The Semantic Web, page 503--517. Berlin, Heidelberg, Springer-Verlag, (2007)
S. Auer, and J. Lehmann. The Semantic Web: Research and Applications, 4th European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, volume 4519 of Lecture Notes in Computer Science, Springer, Berlin, (2007)
T. Tezuka, R. Lee, Y. Kambayashi, and H. Takakura. Proceedings of the Second International Conference on Web Information Systems Engineering, 2, page 14--21. (December 2001)