Twitter corpus for Sentiment Analysis from a class (cs224n)at Stanford.
Class page:
https://sites.google.com/site/twittersentimenthelp/for-researchers#Where_is_the_Tweet_corpus_8553
http://www.stanford.edu/~alecmgo/cs224n
Corpex let's you swiftly browse through all the words of Wikipedia. the system shows you two statistics in four graphs. Corpex is also available as a restful webservice, Corpex is still very much under development. The currently extracted data is still very noisy, and we are currently working on better extraction and filtering approaches. The source code is fully open source, and all the data is also freely available.
MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs. We track the quotes and phrases that appear most frequently over time across this entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. includes a search across text-archives.