The BioScope corpus consists of medical and biological texts annotated for negation, speculation and their linguistic scope. This was done to allow a comparison between the development of systems for negation/hedge detection and scope resolution. The corpus is publicly available for research purposes.
After analyzing a large amount of social annotations, we found that tags are usually semantically related to each other if they are used to tag the same or related resources for many times. Users may have similar interests if their annotations share many
After analyzing a large amount of social annotations, we found that tags are usually semantically related to each other if they are used to tag the same or related resources for many times. Users may have similar interests if their annotations share many
The semantic web must "explain the meaning of words" to computers. Some semantic technologies use a "bottom up" by embedding semantic annotations (metadata) into web content. "Top down" technologies analyze information without metadata using some form of
The semantic web must "explain the meaning of words" to computers. Some semantic technologies use a "bottom up" by embedding semantic annotations (metadata) into web content. "Top down" technologies analyze information without metadata using some form of
Extraktion von strukturiertem Wissen aus Antiken Quellen für die Altertumswissenschaft (eAQUA)
Förderprogramm „Wechselwirkungen zwischen Natur– und Geisteswissenschaften”
Every text search solution is as powerful as the text analysis capabilities it offers. Lucene is such open source information retrieval library offering many text analysis possibilities. In this post, we will cover some of the main text analysis features offered by ElasticSearch available to enrich your search content.
eTBLAST is a unique search engine for searching biomedical literature. it lets you input an entire paragraph and returns MEDLINE abstracts that are similar to it.
FullText.exe is freely available for academic usage. The program generates a word-occurrence matrix, a co-occurrence matrix, and a normalized co-occurrence matrix from a set of text files and a word list.
This is an overview of the open source NLP and machine learning tools for text mining, information extraction, text classification, clustering, approximate string matching, language parsing and tagging, and more.
Natural Language Corpus Data: Beautiful Data
This directory contains code and data to accompany the chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009). If you like this you may also like: How to Write a Spelling Corrector.
Welcome to NewsReader: “Building structured event Indexes of large volumes of financial and economic Data for Decision Making”
The volume of news data is enormous and expanding, covering billions of archived documents with millions of documents added daily. These documents are also getting more and more interconnected with knowledge from other sources such as biographies and company databases.
Professional decision makers who need to respond quickly to new developments or who need to explain these developments on the basis of the past are faced with the problem that current solutions for consulting these archives no longer work. There are simply too many possibly relevant and partially overlapping documents and from these documents decision makers still need to distinguish the correct from the wrong, the new from the old, the actual from the out-of-date by reading the content and maintaining a record in memory. Consequently, it becomes almost impossible to make well-informed decisions and professionals risk to be held liable for decisions based on incomplete, inaccurate and out-of-date information.
NewsReader will process news in 4 different languages when it comes in. It will extract what happened to whom, when and where, removing duplication, complementing information, registering inconsistencies and keeping track of the original sources. Any new information is integrated with the past, distinguishing the new from the old in an unfolding story line, similar to how people tend to remember the past and access knowledge and information. The difference here is that NewsReader can provide access to all original sources and will not forget any details (like a “History Recorder”). We will develop a decision-support tool that allows professional decision makers to explore these story lines using visual interfaces and interactions to exploit their explanatory power and their systematic structural implications. Likewise, NewsReader can make predictions from the past on future events or explain new events and developments through the past.
* Linking Biomedical Information Through Text Mining * Semantic Webs for Life Sciences * Computational Approaches for Pharmacogenomics * Computational Proteomics
A quick tutorial for the Boston Predictive Analytics MeetUp to demonstrate the use of R in the context of text mining Twitter. We implement a very crude algorit
Text Mining Recommendation Systems/ Collaborative Filtering, Structure Web Graph Page Rank/Spam Social Networking, Data Structures Bloom Filters ... Stanford University course; resources, links, more.
Research Interests Comparator (RIC) is our fourth electronic text mining project. The goal of the RIC system is to dramatically improve the ability of biomedical researchers to find information that is relevant to their areas of study, and to provide them
The Software Environment for the Advancement of Scholarly Research (SEASR), funded by the Andrew W. Mellon Foundation, provides a research and development environment capable of powering leading-edge digital humanities initiatives.
Text mining and web scraping involves chunk parsing and recognition of named entities (institutions, dates, titles)...The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) wit
Text mining and web scraping involves chunk parsing and recognition of named entities (institutions, dates, titles)...The extraction of named entities is mostly based on a strategy that combines look up in gazetteers (lists of companies, cities, etc.) wit
TAPoR will build a unique human and computing infrastructure for text analysis across the country by establishing six regional centers to form one national text analysis research portal.
Using the transcripts of Bill Gates' keynote from CES 2007 and Steve Jobs' keynote at Macworld 2007 (via Todd Bishop's Microsoft Blog) I created this relational tagcloud using Rhizome Navigation.
W. Zhao, J. Jiang, J. He, Y. Song, P. Achananuparp, E. Lim, and X. Li. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, page 379--388. Stroudsburg, PA, USA, Association for Computational Linguistics, (2011)