a corpus of dialect speech from the Tyneside area of North-East England. DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades. The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
Developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.
LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD.
The University of Glasgow took over the distribution of the WT2g/WT10g/.GOV/.GOV2 Web Research Collections from CSIRO (Commonwealth Scientific and Industrial Research Organisation), which has been distributing the Web Research collections to organizations and individuals engaged in research and development of natural language processing, information retrieval or document understanding systems, strictly for research purposes only. These collections have been used in the TREC Web & Terabyte tracks. In addition, as part of the TREC Blog track, the University of Glasgow is currently distributing the Blogs06 & Blogs08 test collections. Getting access to the test collections (including .GOV, .GOV2, Blogs06, and Blogs08)
supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. LDC's Catalog contains hundreds of corpora of language data including Santa Barbara Corpus of Spoken American
consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. includes a search across text-archives.
Wired Magazine issue 16.07. Data Deluge. Crop predictions. Quark. Data mining. tracking news. watching the skies, scanning skeletons. airfares. voting. epidemics. google events. terrorism. visualizing big data
We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them. We also
PIE incorporates a database derived from the second or World Edition of the British National Corpus (BNC 2000). It aims to provide a simple yet powerful interface for studying words and phrases up to eight words long appropriate for both experienced researchers and novice users.
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
The Moby lexicon project is complete and has been place into the public domain. Use, sell, rework, excerpt and use in any way on any platform. 610,000+ words and phrases. The largest word list in the world and more.
The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. For example, we collected pages from 350 sites every day for several weeks after the Katrina hurricane disaster. We also collect pages from government Web sites on a regular basis.
DigitalCorpora.org is a website of digital corpora for use in computer forensics education research. All of the disk images, memory dumps, and network packet captures available on this website are freely available and may be used without prior authorization or IRB approval. We also have available a research corpus of real data acquired from around the world.
OPUS is an attempt to collect translated texts from the web, to convert and align the entire collection, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and is also delivered as an open source package. We used several tools to compile the current corpus.
The Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. We also give advice on the creation and use of these resources, and are involved in the development of standards and infrastructure for electronic language resources.
The new TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14 or 2,871,075,221 bytes, and is initially made available to participants of the blog track at the Text Retrieval Conference (TREC), to supplement the BLOG08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow), which is the main corpus used at the TREC Blog Track.
MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs. We track the quotes and phrases that appear most frequently over time across this entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.
This site contains what we believe is the most accurate frequency data of English. It contains word frequency lists of the top 60,000 words (lemmas) in English, collocates lists (looking at nearby words to see word meaning and use), and n-grams (the frequency of all two and three-word sequences in the corpora).
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. It was created by Mark Davies of Brigham Young University in 2008, and it is now used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created or modified, including the British National Corpus (our architecture and interface), the 100 million word TIME Corpus (1920s-2000s), and the new 400 million word Corpus of Historical American English (COHA; 1810-2009). The corpus contains more than 425 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2011 and the corpus is also updated once or twice a year
Corpex let's you swiftly browse through all the words of Wikipedia. the system shows you two statistics in four graphs. Corpex is also available as a restful webservice, Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available.
the Google Books corpus of American English, 155 billion words in size. limited to what you can do via the website at Brigham Young University. The easy thing to do is type in a word or phrase and see its frequency by decade, going back to the 1810s. The interface allows you to look for collocates (words that go with other words), view charts showing relative word frequency in the corpus by decade, handles parts of speech, and gives you various limits and display options. Other kinds of analysis that might be done with text corpora can’t be done through the interface.