consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. includes a search across text-archives.
Wired Magazine issue 16.07. Data Deluge. Crop predictions. Quark. Data mining. tracking news. watching the skies, scanning skeletons. airfares. voting. epidemics. google events. terrorism. visualizing big data
We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them. We also
PIE incorporates a database derived from the second or World Edition of the British National Corpus (BNC 2000). It aims to provide a simple yet powerful interface for studying words and phrases up to eight words long appropriate for both experienced researchers and novice users.
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).