This page provides a large hyperlink graph for public download. The graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. Below we provide instructions on how to download the graph as well as basic statistics about its topology. ·
Researchers at Google annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as they strove for high precision (and, by necessity, lower recall). For each entity they recognized with high confidence, they provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and two confidence levels (computed differently, see below).
You might consider using this data in conjunction with the recently released Freebase annotations of several TREC query sets. ·
Berlin wird leiser: aktiv gegen Verkehrslärm. - Die Senatsverwaltung für Stadtentwicklung und Umwelt Berlin will ihre Bürger an der Erarbeitung des Lärmaktionsplans beteiligen. Alle Bürgerinnen und Bürger können mitteilen, wo es ihnen in Berlin zu laut ist und welche Maßnahmen Abhilfe schaffen könnten. Auf dieser Basis erarbeitet die Stadt Maßnahmen, wie Berlin leiser werden kann. ·
A collection of 28 datasets containing audio features and metadata for a million contemporary popular music tracks.
The collection represents a collaboration between LabROSA and The Echo Nest. More details, background, and instructions on how to use the datasets can be found at LabROSA’s site. The goal of sharing this data on Infochimps is to provide a large dataset for research and to encourage large-scale algorithms surrounding the data.
There is one dataset for each letter of the alphabet (A-Z) containing data for all songs that start with that letter, one dataset of additional files, and a small sample dataset.
Each of the datasets for each letter consists of song files in the HDF5 format.
Most of the data is licensed the same way as Echo Nest’s API. The code is under GNU public license. ·
Kaggle is a platform for data prediction competitions. Companies, organizations and researchers post their data and have it scrutinized by the world's best statisticians. ·
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included. ·