Our goal is to develop a probabilistic knowledge base that mirrors the content of the web. We are developing a system that uses semi-supervised learning methods to learn to extract symbolic knowledge from unstructured text and HTML. We are exploring methods of continous learning, where our system runs 24x7, continuously learning to read better, and continuously extracting facts from the web.
Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up
Semantic MediaWiki (SMW) is a free extension of MediaWiki that helps to search, organise, tag, browse, evaluate, and share the wiki's content. While traditional wikis contain only texts which computers can neither understand nor evaluate, SMW adds semantic annotations that bring the power of the Semantic Web to the wiki.
MuNPEx is a multi-lingual noun phrase (NP) extraction component developed for the GATE architecture, implemented in JAPE. It currently supports English, German, French, and Spanish (in beta).
MuNPEx requires a part-of-speech (POS) tagger to work and can additionally use detected named entities (NEs) to improve chunking performance. Please read the documentation (or source code) for more details.
E. Riloff, C. Schafer, and D. Yarowsky. Proceedings of the 19th international conference on Computational linguistics, page 1--7. Morristown, NJ, USA, Association for Computational Linguistics, (2002)