Supported by Google Ideas, the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
The UK Reading Experience Database (UK RED) is an open access database and research project housed in the English Department of the Open University. It is the largest resource recording the experiences of readers of its kind anywhere. UK RED has amassed over 30,000 records of reading experiences of British subjects, both at home and abroad, and of visitors to the British Isles, between 1450 and 1945. These include both famous and anonymous readers. It is both an open access resource and open to unsolicited public contributions.
HUD USER provides interested researchers with access to the original electronic data sets generated by PD&R sponsored data collection efforts, including the American Housing Survey, HUD median family income limits, as well as microdata from research initiatives on topics such as housing discrimination, the HUD-insured multifamily housing stock, and the public housing population.
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
Erik Gartzke, Associate Professor, Political Science, University of California, San Diego. links to data. United Nations General Assembly Voting Data, The Affinity of Nations: Similarity of State Voting Positions in the UNGA, Disaggregated Military Expenditure, Nuclear Production Capabilities, Intergovernmental Organization, "The Capitalist Peace" Replication Data
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Various US databases provided by federal government agencies. Census, Labor Statistics, Transportation, Economics. Also: A 3D Version of the PubChem Library, Annotated Human Genome Data.
YQL (Yahoo Query Language) works with arbitrary structured (XML or JSON) documents with repeating elements, such as a list of restaurants or search results. Different "known" collections of these items are presented as "tables" in the YQL syntax, and are notionally namespaced based on the service providing the data.
StatLib, a system for distributing statistical software, datasets, and information. started in 1989. hosted by the Department of Statistics at Carnegie Mellon University.
Collection of economic, social and environmental time series data from sources including the United Kingdom government, the Federal Reserve System and the European Central Bank. you can build graphs and embed them into your blogs and websites, and if the data they're based on is updated, they'll be updated too. You can set up alerts too, and get Timetric to email you when something interesting happens to a value you're watching. Also has an API.
The Moby lexicon project is complete and has been place into the public domain. Use, sell, rework, excerpt and use in any way on any platform. 610,000+ words and phrases. The largest word list in the world and more.
LexisNexis™ Statistical DataSets is a new online service that enables researchers to build statistical tables and charts from multiple sources in a single interface. This online interactive statistical solution aggregates over 580 licensed and public domain datasets provided by 50 sources. The DataSets product makes 12.0 billion data points accessible within a single interface.
the data here is useful for testing classification / clustering, and the accuracy of indexing techniques. However the datasets are too small to make claims about the efficiency of indexing.
These datasets were designed for experiments in "Finding Underlying Connections: A Fast Graph-Based Method for Link Analysis and Collaboration Queries".
CFDR aims at accelerating research on system reliability by filling the nearly empty collection of public data with detailed failure data from a variety of large production systems. contribute data, download data.
Let's say you are looking for locations for Retailer XYZ. Well, XYZ keeps all of their location information in a central database, usually housed on a server in a large datacenter. They do not just hand out dumps of this information to anyone who asks. But, company XYZ knows that the customer needs to be able to find locations in their neighborhood, so XYZ puts all of this information on their own website, publicly available to the world. Usually the company will add some mechanism to limit the amount of location data a web visitor can find at once, such as a store locator, but our team has developed methods of working with these mechanisms to comprehensively iterate through the data source. Consequently, we provide a very complete and accurate resource for your need, whether for personal or business use. some free data, some with fee.
The Database contains over two hundred pieces of information about each case decided by the Court between the 1953 and 2008 terms. http://supremecourtdatabase.org/index.php
Center for Global Development data sets: Cross-Country Data on AIDS Treatment and HIV Prevalence in 2006-07, Owen McCarthy and Mead Over. The Fate of Young Democracies. Net Aid Transfers data set (1960-2007), David Roodman. New Data on African Health Professionals Abroad, Michael Clemens. Anarchy of Numbers data set, David Roodman. Aid, Policies, and Growth data set, William Easterly, Ross Levine and David Roodman.
The National Digital Archive of Datasets (NDAD) preserves and provides online access to archived digital datasets and documents from UK central government departments. Our collection spans 40 years of recent history, with the earliest available dataset dating back to about 1963.
GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota. datasets include MovieLens, Wikilens, Book-Crossing, Jester Joke, EachMovie.
a torrent tracker for public datasets. If you are scientist, research developer or just interested in it, you can find and download some dataset or, if you are owner of dataset, you can publish this dataset (become a torrent seeder) at this site.
The Open Economics project provides open content, data and code related to Economics. This site itself provides interfaces to some (though not all) of the Open Economics datasets and models.