Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
The Moby lexicon project is complete and has been place into the public domain. Use, sell, rework, excerpt and use in any way on any platform. 610,000+ words and phrases. The largest word list in the world and more.
The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. For example, we collected pages from 350 sites every day for several weeks after the Katrina hurricane disaster. We also collect pages from government Web sites on a regular basis.