The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
Developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
the Google Books corpus of American English, 155 billion words in size. limited to what you can do via the website at Brigham Young University. The easy thing to do is type in a word or phrase and see its frequency by decade, going back to the 1810s. The interface allows you to look for collocates (words that go with other words), view charts showing relative word frequency in the corpus by decade, handles parts of speech, and gives you various limits and display options. Other kinds of analysis that might be done with text corpora can’t be done through the interface.
[accès réservé] Corpus de textes français, du XVIe au XXe s., plus de 3700 œuvres dont 80% de textes littéraires et 20% de textes techniques. Pas d’accès au texte intégral des œuvres