Supported by Google Ideas, the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
The UK Reading Experience Database (UK RED) is an open access database and research project housed in the English Department of the Open University. It is the largest resource recording the experiences of readers of its kind anywhere. UK RED has amassed over 30,000 records of reading experiences of British subjects, both at home and abroad, and of visitors to the British Isles, between 1450 and 1945. These include both famous and anonymous readers. It is both an open access resource and open to unsolicited public contributions.
HUD USER provides interested researchers with access to the original electronic data sets generated by PD&R sponsored data collection efforts, including the American Housing Survey, HUD median family income limits, as well as microdata from research initiatives on topics such as housing discrimination, the HUD-insured multifamily housing stock, and the public housing population.
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
Erik Gartzke, Associate Professor, Political Science, University of California, San Diego. links to data. United Nations General Assembly Voting Data, The Affinity of Nations: Similarity of State Voting Positions in the UNGA, Disaggregated Military Expenditure, Nuclear Production Capabilities, Intergovernmental Organization, "The Capitalist Peace" Replication Data
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Various US databases provided by federal government agencies. Census, Labor Statistics, Transportation, Economics. Also: A 3D Version of the PubChem Library, Annotated Human Genome Data.
YQL (Yahoo Query Language) works with arbitrary structured (XML or JSON) documents with repeating elements, such as a list of restaurants or search results. Different "known" collections of these items are presented as "tables" in the YQL syntax, and are notionally namespaced based on the service providing the data.
StatLib, a system for distributing statistical software, datasets, and information. started in 1989. hosted by the Department of Statistics at Carnegie Mellon University.