a corpus of dialect speech from the Tyneside area of North-East England. DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades. The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
the Google Books corpus of American English, 155 billion words in size. limited to what you can do via the website at Brigham Young University. The easy thing to do is type in a word or phrase and see its frequency by decade, going back to the 1810s. The interface allows you to look for collocates (words that go with other words), view charts showing relative word frequency in the corpus by decade, handles parts of speech, and gives you various limits and display options. Other kinds of analysis that might be done with text corpora can’t be done through the interface.
The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. For example, we collected pages from 350 sites every day for several weeks after the Katrina hurricane disaster. We also collect pages from government Web sites on a regular basis.
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. LDC's Catalog contains hundreds of corpora of language data including Santa Barbara Corpus of Spoken American